Search Results: "tats"

23 March 2023

Dirk Eddelbuettel: RcppSMC 0.2.7 on CRAN: Extensions and Update

A new release 0.2.7 of our RcppSMC package arrived at CRAN earlier today. It contains several extensions added by team member (and former GSoC student) Ilya Zarubin since the last release. We were a little slow to release those but one of those CRAN emails forced our hand for a release now. The updated uninitialized variable messages in clang++-16 have found a fan in Brian Ripley, and so he sent us a note. And as the issue was trivially reproducible with clang++-15 here too I had it fixed in no time. And both changes taken together form the incremental 0.2.7 release. RcppSMC provides Rcpp-based bindings to R for the Sequential Monte Carlo Template Classes (SMCTC) by Adam Johansen described in his JSS article. Sequential Monte Carlo is also referred to as Particle Filter in some contexts. The package now also features the Google Summer of Code work by Leah South in 2017, and by Ilya Zarubin in 2021. The release is summarized below.

Changes in RcppSMC version 0.2.7 (2023-03-22)
  • Extensive extensions for conditional SMC and resample, updated hello_world example, added skeleton function for easier package creation (Ilya in #67,#72)
  • Small package updates (Dirk in #75 fixing #74)

Courtesy of my CRANberries, there is a diffstat report for this release. More information is on the RcppSMC page. Issues and bugreports should go to the GitHub issue tracker. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

13 March 2023

Antoine Beaupr : Framework 12th gen laptop review

The Framework is a 13.5" laptop body with swappable parts, which makes it somewhat future-proof and certainly easily repairable, scoring an "exceedingly rare" 10/10 score from ifixit.com. There are two generations of the laptop's main board (both compatible with the same body): the Intel 11th and 12th gen chipsets. I have received my Framework, 12th generation "DIY", device in late September 2022 and will update this page as I go along in the process of ordering, burning-in, setting up and using the device over the years. Overall, the Framework is a good laptop. I like the keyboard, the touch pad, the expansion cards. Clearly there's been some good work done on industrial design, and it's the most repairable laptop I've had in years. Time will tell, but it looks sturdy enough to survive me many years as well. This is also one of the most powerful devices I ever lay my hands on. I have managed, remotely, more powerful servers, but this is the fastest computer I have ever owned, and it fits in this tiny case. It is an amazing machine. On the downside, there's a bit of proprietary firmware required (WiFi, Bluetooth, some graphics) and the Framework ships with a proprietary BIOS, with currently no Coreboot support. Expect to need the latest kernel, firmware, and hacking around a bunch of things to get resolution and keybindings working right. Like others, I have first found significant power management issues, but many issues can actually be solved with some configuration. Some of the expansion ports (HDMI, DP, MicroSD, and SSD) use power when idle, so don't expect week-long suspend, or "full day" battery while those are plugged in. Finally, the expansion ports are nice, but there's only four of them. If you plan to have a two-monitor setup, you're likely going to need a dock. Read on for the detailed review. For context, I'm moving from the Purism Librem 13v4 because it basically exploded on me. I had, in the meantime, reverted back to an old ThinkPad X220, so I sometimes compare the Framework with that venerable laptop as well. This blog post has been maturing for months now. It started in September 2022 and I declared it completed in March 2023. It's the longest single article on this entire website, currently clocking at about 13,000 words. It will take an average reader a full hour to go through this thing, so I don't expect anyone to actually do that. This introduction should be good enough for most people, read the first section if you intend to actually buy a Framework. Jump around the table of contents as you see fit for after you did buy the laptop, as it might include some crucial hints on how to make it work best for you, especially on (Debian) Linux.

Advice for buyers Those are things I wish I would have known before buying:
  1. consider buying 4 USB-C expansion cards, or at least a mix of 4 USB-A or USB-C cards, as they use less power than other cards and you do want to fill those expansion slots otherwise they snag around and feel insecure
  2. you will likely need a dock or at least a USB hub if you want a two-monitor setup, otherwise you'll run out of ports
  3. you have to do some serious tuning to get proper (10h+ idle, 10 days suspend) power savings
  4. in particular, beware that the HDMI, DisplayPort and particularly the SSD and MicroSD cards take a significant amount power, even when sleeping, up to 2-6W for the latter two
  5. beware that the MicroSD card is what it says: Micro, normal SD cards won't fit, and while there might be full sized one eventually, it's currently only at the prototyping stage
  6. the Framework monitor has an unusual aspect ratio (3:2): I like it (and it matches classic and digital photography aspect ratio), but it might surprise you

Current status I have the framework! It's setup with a fresh new Debian bookworm installation. I've ran through a large number of tests and burn in. I have decided to use the Framework as my daily driver, and had to buy a USB-C dock to get my two monitors connected, which was own adventure. Update: Framework just (2023-03-23) just announced a whole bunch of new stuff: The recording is available in this video and it's not your typical keynote. It starts ~25 minutes late, audio is crap, lightning and camera are crap, clapping seems to be from whatever staff they managed to get together in a room, decor is bizarre, colors are shit. It's amazing.

Specifications Those are the specifications of the 12th gen, in general terms. Your build will of course vary according to your needs.
  • CPU: i5-1240P, i7-1260P, or i7-1280P (Up to 4.4-4.8 GHz, 4+8 cores), Iris Xe graphics
  • Storage: 250-4000GB NVMe (or bring your own)
  • Memory: 8-64GB DDR4-3200 (or bring your own)
  • WiFi 6e (AX210, vPro optional, or bring your own)
  • 296.63mm X 228.98mm X 15.85mm, 1.3Kg
  • 13.5" display, 3:2 ratio, 2256px X 1504px, 100% sRGB, >400 nit
  • 4 x USB-C user-selectable expansion ports, including
    • USB-C
    • USB-A
    • HDMI
    • DP
    • Ethernet
    • MicroSD
    • 250-1000GB SSD
  • 3.5mm combo headphone jack
  • Kill switches for microphone and camera
  • Battery: 55Wh
  • Camera: 1080p 60fps
  • Biometrics: Fingerprint Reader
  • Backlit keyboard
  • Power Adapter: 60W USB-C (or bring your own)
  • ships with a screwdriver/spludger
  • 1 year warranty
  • base price: 1000$CAD, but doesn't give you much, typical builds around 1500-2000$CAD

Actual build This is the actual build I ordered. Amounts in CAD. (1CAD = ~0.75EUR/USD.)

Base configuration
  • CPU: Intel Core i5-1240P (AKA Alder Lake P 8 4.4GHz P-threads, 8 3.2GHz E-threads, 16 total, 28-64W), 1079$
  • Memory: 16GB (1 x 16GB) DDR4-3200, 104$

Customization
  • Keyboard: US English, included

Expansion Cards
  • 2 USB-C $24
  • 3 USB-A $36
  • 2 HDMI $50
  • 1 DP $50
  • 1 MicroSD $25
  • 1 Storage 1TB $199
  • Sub-total: 384$

Accessories
  • Power Adapter - US/Canada $64.00

Total
  • Before tax: 1606$
  • After tax and duties: 1847$
  • Free shipping

Quick evaluation This is basically the TL;DR: here, just focusing on broad pros/cons of the laptop.

Pros

Cons
  • the 11th gen is out of stock, except for the higher-end CPUs, which are much less affordable (700$+)
  • the 12th gen has compatibility issues with Debian, followup in the DebianOn page, but basically: brightness hotkeys, power management, wifi, the webcam is okay even though the chipset is the infamous alder lake because it does not have the fancy camera; most issues currently seem solvable, and upstream is working with mainline to get their shit working
  • 12th gen might have issues with thunderbolt docks
  • they used to have some difficulty keeping up with the orders: first two batches shipped, third batch sold out, fourth batch should have shipped (?) in October 2021. they generally seem to keep up with shipping. update (august 2022): they rolled out a second line of laptops (12th gen), first batch shipped, second batch shipped late, September 2022 batch was generally on time, see this spreadsheet for a crowdsourced effort to track those supply chain issues seem to be under control as of early 2023. I got the Ethernet expansion card shipped within a week.
  • compared to my previous laptop (Purism Librem 13v4), it feels strangely bulkier and heavier; it's actually lighter than the purism (1.3kg vs 1.4kg) and thinner (15.85mm vs 18mm) but the design of the Purism laptop (tapered edges) makes it feel thinner
  • no space for a 2.5" drive
  • rather bright LED around power button, but can be dimmed in the BIOS (not low enough to my taste) I got used to it
  • fan quiet when idle, but can be noisy when running, for example if you max a CPU for a while
  • battery described as "mediocre" by Ars Technica (above), confirmed poor in my tests (see below)
  • no RJ-45 port, and attempts at designing ones are failing because the modular plugs are too thin to fit (according to Linux After Dark), so unlikely to have one in the future Update: they cracked that nut and ship an 2.5 gbps Ethernet expansion card with a realtek chipset, without any firmware blob (!)
  • a bit pricey for the performance, especially when compared to the competition (e.g. Dell XPS, Apple M1)
  • 12th gen Intel has glitchy graphics, seems like Intel hasn't fully landed proper Linux support for that chipset yet

Initial hardware setup A breeze.

Accessing the board The internals are accessed through five TorX screws, but there's a nice screwdriver/spudger that works well enough. The screws actually hold in place so you can't even lose them. The first setup is a bit counter-intuitive coming from the Librem laptop, as I expected the back cover to lift and give me access to the internals. But instead the screws is release the keyboard and touch pad assembly, so you actually need to flip the laptop back upright and lift the assembly off (!) to get access to the internals. Kind of scary. I also actually unplugged a connector in lifting the assembly because I lifted it towards the monitor, while you actually need to lift it to the right. Thankfully, the connector didn't break, it just snapped off and I could plug it back in, no harm done. Once there, everything is well indicated, with QR codes all over the place supposedly leading to online instructions.

Bad QR codes Unfortunately, the QR codes I tested (in the expansion card slot, the memory slot and CPU slots) did not actually work so I wonder how useful those actually are. After all, they need to point to something and that means a URL, a running website that will answer those requests forever. I bet those will break sooner than later and in fact, as far as I can tell, they just don't work at all. I prefer the approach taken by the MNT reform here which designed (with the 100 rabbits folks) an actual paper handbook (PDF). The first QR code that's immediately visible from the back of the laptop, in an expansion cord slot, is a 404. It seems to be some serial number URL, but I can't actually tell because, well, the page is a 404. I was expecting that bar code to lead me to an introduction page, something like "how to setup your Framework laptop". Support actually confirmed that it should point a quickstart guide. But in a bizarre twist, they somehow sent me the URL with the plus (+) signs escaped, like this:
https://guides.frame.work/Guide/Framework\+Laptop\+DIY\+Edition\+Quick\+Start\+Guide/57
... which Firefox immediately transforms in:
https://guides.frame.work/Guide/Framework/+Laptop/+DIY/+Edition/+Quick/+Start/+Guide/57
I'm puzzled as to why they would send the URL that way, the proper URL is of course:
https://guides.frame.work/Guide/Framework+Laptop+DIY+Edition+Quick+Start+Guide/57
(They have also "let the team know about this for feedback and help resolve the problem with the link" which is a support code word for "ha-ha! nope! not my problem right now!" Trust me, I know, my own code word is "can you please make a ticket?")

Seating disks and memory The "DIY" kit doesn't actually have that much of a setup. If you bought RAM, it's shipped outside the laptop in a little plastic case, so you just seat it in as usual. Then you insert your NVMe drive, and, if that's your fancy, you also install your own mPCI WiFi card. If you ordered one (which was my case), it's pre-installed. Closing the laptop is also kind of amazing, because the keyboard assembly snaps into place with magnets. I have actually used the laptop with the keyboard unscrewed as I was putting the drives in and out, and it actually works fine (and will probably void your warranty, so don't do that). (But you can.) (But don't, really.)

Hardware review

Keyboard and touch pad The keyboard feels nice, for a laptop. I'm used to mechanical keyboard and I'm rather violent with those poor things. Yet the key travel is nice and it's clickety enough that I don't feel too disoriented. At first, I felt the keyboard as being more laggy than my normal workstation setup, but it turned out this was a graphics driver issues. After enabling a composition manager, everything feels snappy. The touch pad feels good. The double-finger scroll works well enough, and I don't have to wonder too much where the middle button is, it just works. Taps don't work, out of the box: that needs to be enabled in Xorg, with something like this:
cat > /etc/X11/xorg.conf.d/40-libinput.conf <<EOF
Section "InputClass"
      Identifier "libinput touch pad catchall"
      MatchIsTouchpad "on"
      MatchDevicePath "/dev/input/event*"
      Driver "libinput"
      Option "Tapping" "on"
      Option "TappingButtonMap" "lmr"
EndSection
EOF
But be aware that once you enable that tapping, you'll need to deal with palm detection... So I have not actually enabled this in the end.

Power button The power button is a little dangerous. It's quite easy to hit, as it's right next to one expansion card where you are likely to plug in a cable power. And because the expansion cards are kind of hard to remove, you might squeeze the laptop (and the power key) when trying to remove the expansion card next to the power button. So obviously, don't do that. But that's not very helpful. An alternative is to make the power button do something else. With systemd-managed systems, it's actually quite easy. Add a HandlePowerKey stanza to (say) /etc/systemd/logind.conf.d/power-suspends.conf:
[Login]
HandlePowerKey=suspend
HandlePowerKeyLongPress=poweroff
You might have to create the directory first:
mkdir /etc/systemd/logind.conf.d/
Then restart logind:
systemctl restart systemd-logind
And the power button will suspend! Long-press to power off doesn't actually work as the laptop immediately suspends... Note that there's probably half a dozen other ways of doing this, see this, this, or that.

Special keybindings There is a series of "hidden" (as in: not labeled on the key) keybindings related to the fn keybinding that I actually find quite useful.
Key Equivalent Effect Command
p Pause lock screen xset s activate
b Break ? ?
k ScrLk switch keyboard layout N/A
It looks like those are defined in the microcontroller so it would be possible to add some. For example, the SysRq key is almost bound to fn s in there. Note that most other shortcuts like this are clearly documented (volume, brightness, etc). One key that's less obvious is F12 that only has the Framework logo on it. That actually calls the keysym XF86AudioMedia which, interestingly, does absolutely nothing here. By default, on Windows, it opens your browser to the Framework website and, on Linux, your "default media player". The keyboard backlight can be cycled with fn-space. The dimmer version is dim enough, and the keybinding is easy to find in the dark. A skinny elephant would be performed with alt PrtScr (above F11) KEY, so for example alt fn F11 b should do a hard reset. This comment suggests you need to hold the fn only if "function lock" is on, but that's actually the opposite of my experience. Out of the box, some of the fn keys don't work. Mute, volume up/down, brightness, monitor changes, and the airplane mode key all do basically nothing. They don't send proper keysyms to Xorg at all. This is a known problem and it's related to the fact that the laptop has light sensors to adjust the brightness automatically. Somehow some of those keys (e.g. the brightness controls) are supposed to show up as a different input device, but don't seem to work correctly. It seems like the solution is for the Framework team to write a driver specifically for this, but so far no progress since July 2022. In the meantime, the fancy functionality can be supposedly disabled with:
echo 'blacklist hid_sensor_hub'   sudo tee /etc/modprobe.d/framework-als-blacklist.conf
... and a reboot. This solution is also documented in the upstream guide. Note that there's another solution flying around that fixes this by changing permissions on the input device but I haven't tested that or seen confirmation it works.

Kill switches The Framework has two "kill switches": one for the camera and the other for the microphone. The camera one actually disconnects the USB device when turned off, and the mic one seems to cut the circuit. It doesn't show up as muted, it just stops feeding the sound. Both kill switches are around the main camera, on top of the monitor, and quite discreet. Then turn "red" when enabled (i.e. "red" means "turned off").

Monitor The monitor looks pretty good to my untrained eyes. I have yet to do photography work on it, but some photos I looked at look sharp and the colors are bright and lively. The blacks are dark and the screen is bright. I have yet to use it in full sunlight. The dimmed light is very dim, which I like.

Screen backlight I bind brightness keys to xbacklight in i3, but out of the box I get this error:
sep 29 22:09:14 angela i3[5661]: No outputs have backlight property
It just requires this blob in /etc/X11/xorg.conf.d/backlight.conf:
Section "Device"
    Identifier  "Card0"
    Driver      "intel"
    Option      "Backlight"  "intel_backlight"
EndSection
This way I can control the actual backlight power with the brightness keys, and they do significantly reduce power usage.

Multiple monitor support I have been able to hook up my two old monitors to the HDMI and DisplayPort expansion cards on the laptop. The lid closes without suspending the machine, and everything works great. I actually run out of ports, even with a 4-port USB-A hub, which gives me a total of 7 ports:
  1. power (USB-C)
  2. monitor 1 (DisplayPort)
  3. monitor 2 (HDMI)
  4. USB-A hub, which adds:
  5. keyboard (USB-A)
  6. mouse (USB-A)
  7. Yubikey
  8. external sound card
Now the latter, I might be able to get rid of if I switch to a combo-jack headset, which I do have (and still need to test). But still, this is a problem. I'll probably need a powered USB-C dock and better monitors, possibly with some Thunderbolt chaining, to save yet more ports. But that means more money into this setup, argh. And figuring out my monitor situation is the kind of thing I'm not that big of a fan of. And neither is shopping for USB-C (or is it Thunderbolt?) hubs. My normal autorandr setup doesn't work: I have tried saving a profile and it doesn't get autodetected, so I also first need to do:
autorandr -l framework-external-dual-lg-acer
The magic:
autorandr -l horizontal
... also works well. The worst problem with those monitors right now is that they have a radically smaller resolution than the main screen on the laptop, which means I need to reset the font scaling to normal every time I switch back and forth between those monitors and the laptop, which means I actually need to do this:
autorandr -l horizontal &&
eho Xft.dpi: 96   xrdb -merge &&
systemctl restart terminal xcolortaillog background-image emacs &&
i3-msg restart
Kind of disruptive.

Expansion ports I ordered a total of 10 expansion ports. I did manage to initialize the 1TB drive as an encrypted storage, mostly to keep photos as this is something that takes a massive amount of space (500GB and counting) and that I (unfortunately) don't work on very often (but still carry around). The expansion ports are fancy and nice, but not actually that convenient. They're a bit hard to take out: you really need to crimp your fingernails on there and pull hard to take them out. There's a little button next to them to release, I think, but at first it feels a little scary to pull those pucks out of there. You get used to it though, and it's one of those things you can do without looking eventually. There's only four expansion ports. Once you have two monitors, the drive, and power plugged in, bam, you're out of ports; there's nowhere to plug my Yubikey. So if this is going to be my daily driver, with a dual monitor setup, I will need a dock, which means more crap firmware and uncertainty, which isn't great. There are actually plans to make a dual-USB card, but that is blocked on designing an actual board for this. I can't wait to see more expansion ports produced. There's a ethernet expansion card which quickly went out of stock basically the day it was announced, but was eventually restocked. I would like to see a proper SD-card reader. There's a MicroSD card reader, but that obviously doesn't work for normal SD cards, which would be more broadly compatible anyways (because you can have a MicroSD to SD card adapter, but I have never heard of the reverse). Someone actually found a SD card reader that fits and then someone else managed to cram it in a 3D printed case, which is kind of amazing. Still, I really like that idea that I can carry all those little adapters in a pouch when I travel and can basically do anything I want. It does mean I need to shuffle through them to find the right one which is a little annoying. I have an elastic band to keep them lined up so that all the ports show the same side, to make it easier to find the right one. But that quickly gets undone and instead I have a pouch full of expansion cards. Another awesome thing with the expansion cards is that they don't just work on the laptop: anything that takes USB-C can take those cards, which means you can use it to connect an SD card to your phone, for backups, for example. Heck, you could even connect an external display to your phone that way, assuming that's supported by your phone of course (and it probably isn't). The expansion ports do take up some power, even when idle. See the power management section below, and particularly the power usage tests for details.

USB-C charging One thing that is really a game changer for me is USB-C charging. It's hard to overstate how convenient this is. I often have a USB-C cable lying around to charge my phone, and I can just grab that thing and pop it in my laptop. And while it will obviously not charge as fast as the provided charger, it will stop draining the battery at least. (As I wrote this, I had the laptop plugged in the Samsung charger that came with a phone, and it was telling me it would take 6 hours to charge the remaining 15%. With the provided charger, that flew down to 15 minutes. Similarly, I can power the laptop from the power grommet on my desk, reducing clutter as I have that single wire out there instead of the bulky power adapter.) I also really like the idea that I can charge my laptop with a power bank or, heck, with my phone, if push comes to shove. (And vice-versa!) This is awesome. And it works from any of the expansion ports, of course. There's a little led next to the expansion ports as well, which indicate the charge status:
  • red/amber: charging
  • white: charged
  • off: unplugged
I couldn't find documentation about this, but the forum answered. This is something of a recurring theme with the Framework. While it has a good knowledge base and repair/setup guides (and the forum is awesome) but it doesn't have a good "owner manual" that shows you the different parts of the laptop and what they do. Again, something the MNT reform did well. Another thing that people are asking about is an external sleep indicator: because the power LED is on the main keyboard assembly, you don't actually see whether the device is active or not when the lid is closed. Finally, I wondered what happens when you plug in multiple power sources and it turns out the charge controller is actually pretty smart: it will pick the best power source and use it. The only downside is it can't use multiple power sources, but that seems like a bit much to ask.

Multimedia and other devices Those things also work:
  • webcam: splendid, best webcam I've ever had (but my standards are really low)
  • onboard mic: works well, good gain (maybe a bit much)
  • onboard speakers: sound okay, a little metal-ish, loud enough to be annoying, see this thread for benchmarks, apparently pretty good speakers
  • combo jack: works, with slight hiss, see below
There's also a light sensor, but it conflicts with the keyboard brightness controls (see above). There's also an accelerometer, but it's off by default and will be removed from future builds.

Combo jack mic tests The Framework laptop ships with a combo jack on the left side, which allows you to plug in a CTIA (source) headset. In human terms, it's a device that has both a stereo output and a mono input, typically a headset or ear buds with a microphone somewhere. It works, which is better than the Purism (which only had audio out), but is on par for the course for that kind of onboard hardware. Because of electrical interference, such sound cards very often get lots of noise from the board. With a Jabra Evolve 40, the built-in USB sound card generates basically zero noise on silence (invisible down to -60dB in Audacity) while plugging it in directly generates a solid -30dB hiss. There is a noise-reduction system in that sound card, but the difference is still quite striking. On a comparable setup (curie, a 2017 Intel NUC), there is also a his with the Jabra headset, but it's quieter, more in the order of -40/-50 dB, a noticeable difference. Interestingly, testing with my Mee Audio Pro M6 earbuds leads to a little more hiss on curie, more on the -35/-40 dB range, close to the Framework. Also note that another sound card, the Antlion USB adapter that comes with the ModMic 4, also gives me pretty close to silence on a quiet recording, picking up less than -50dB of background noise. It's actually probably picking up the fans in the office, which do make audible noises. In other words, the hiss of the sound card built in the Framework laptop is so loud that it makes more noise than the quiet fans in the office. Or, another way to put it is that two USB sound cards (the Jabra and the Antlion) are able to pick up ambient noise in my office but not the Framework laptop. See also my audio page.

Performance tests

Compiling Linux 5.19.11 On a single core, compiling the Debian version of the Linux kernel takes around 100 minutes:
5411.85user 673.33system 1:37:46elapsed 103%CPU (0avgtext+0avgdata 831700maxresident)k
10594704inputs+87448000outputs (9131major+410636783minor)pagefaults 0swaps
This was using 16 watts of power, with full screen brightness. With all 16 cores (make -j16), it takes less than 25 minutes:
19251.06user 2467.47system 24:13.07elapsed 1494%CPU (0avgtext+0avgdata 831676maxresident)k
8321856inputs+87427848outputs (30792major+409145263minor)pagefaults 0swaps
I had to plug the normal power supply after a few minutes because battery would actually run out using my desk's power grommet (34 watts). During compilation, fans were spinning really hard, quite noisy, but not painfully so. The laptop was sucking 55 watts of power, steadily:
  Time    User  Nice   Sys  Idle    IO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts
-------- ----- ----- ----- ----- ----- ---- ------ ------ ---- ---- ---- ------
 Average  87.9   0.0  10.7   1.4   0.1 17.8 6583.6 5054.3 233.0 223.9 233.1  55.96
 GeoMean  87.9   0.0  10.6   1.2   0.0 17.6 6427.8 5048.1 227.6 218.7 227.7  55.96
  StdDev   1.4   0.0   1.2   0.6   0.2  3.0 1436.8  255.5 50.0 47.5 49.7   0.20
-------- ----- ----- ----- ----- ----- ---- ------ ------ ---- ---- ---- ------
 Minimum  85.0   0.0   7.8   0.5   0.0 13.0 3594.0 4638.0 117.0 111.0 120.0  55.52
 Maximum  90.8   0.0  12.9   3.5   0.8 38.0 10174.0 5901.0 374.0 362.0 375.0  56.41
-------- ----- ----- ----- ----- ----- ---- ------ ------ ---- ---- ---- ------
Summary:
CPU:  55.96 Watts on average with standard deviation 0.20
Note: power read from RAPL domains: package-0, uncore, package-0, core, psys.
These readings do not cover all the hardware in this device.

memtest86+ I ran Memtest86+ v6.00b3. It shows something like this:
Memtest86+ v6.00b3        12th Gen Intel(R) Core(TM) i5-1240P
CLK/Temp: 2112MHz    78/78 C   Pass  2% #
L1 Cache:   48KB    414 GB/s   Test 46% ##################
L2 Cache: 1.25MB    118 GB/s   Test #3 [Moving inversions, 1s & 0s] 
L3 Cache:   12MB     43 GB/s   Testing: 16GB - 18GB [1GB of 15.7GB]
Memory  :  15.7GB  14.9 GB/s   Pattern: 
--------------------------------------------------------------------------------
CPU: 4P+8E-Cores (16T)    SMP: 8T (PAR))    Time:  0:27:23  Status: Pass     \
RAM: 1600MHz (DDR4-3200) CAS 22-22-22-51    Pass:  1        Errors: 0
--------------------------------------------------------------------------------
Memory SPD Information
----------------------
 - Slot 2: 16GB DDR-4-3200 - Crucial CT16G4SFRA32A.C16FP (2022-W23)
                          Framework FRANMACP04
 <ESC> Exit  <F1> Configuration  <Space> Scroll Lock            6.00.unknown.x64
So about 30 minutes for a full 16GB memory test.

Software setup Once I had everything in the hardware setup, I figured, voil , I'm done, I'm just going to boot this beautiful machine and I can get back to work. I don't understand why I am so na ve some times. It's mind boggling. Obviously, it didn't happen that way at all, and I spent the best of the three following days tinkering with the laptop.

Secure boot and EFI First, I couldn't boot off of the NVMe drive I transferred from the previous laptop (the Purism) and the BIOS was not very helpful: it was just complaining about not finding any boot device, without dropping me in the real BIOS. At first, I thought it was a problem with my NVMe drive, because it's not listed in the compatible SSD drives from upstream. But I figured out how to enter BIOS (press F2 manically, of course), which showed the NVMe drive was actually detected. It just didn't boot, because it was an old (2010!!) Debian install without EFI. So from there, I disabled secure boot, and booted a grml image to try to recover. And by "boot" I mean, I managed to get to the grml boot loader which promptly failed to load its own root file system somehow. I still have to investigate exactly what happened there, but it failed some time after the initrd load with:
Unable to find medium containing a live file system
This, it turns out, was fixed in Debian lately, so a daily GRML build will not have this problems. The upcoming 2022 release (likely 2022.10 or 2022.11) will also get the fix. I did manage to boot the development version of the Debian installer which was a surprisingly good experience: it mounted the encrypted drives and did everything pretty smoothly. It even offered me to reinstall the boot loader, but that ultimately (and correctly, as it turns out) failed because I didn't have a /boot/efi partition. At this point, I realized there was no easy way out of this, and I just proceeded to completely reinstall Debian. I had a spare NVMe drive lying around (backups FTW!) so I just swapped that in, rebooted in the Debian installer, and did a clean install. I wanted to switch to bookworm anyways, so I guess that's done too.

Storage limitations Another thing that happened during setup is that I tried to copy over the internal 2.5" SSD drive from the Purism to the Framework 1TB expansion card. There's no 2.5" slot in the new laptop, so that's pretty much the only option for storage expansion. I was tired and did something wrong. I ended up wiping the partition table on the original 2.5" drive. Oops. It might be recoverable, but just restoring the partition table didn't work either, so I'm not sure how I recover the data there. Normally, everything on my laptops and workstations is designed to be disposable, so that wasn't that big of a problem. I did manage to recover most of the data thanks to git-annex reinit, but that was a little hairy.

Bootstrapping Puppet Once I had some networking, I had to install all the packages I needed. The time I spent setting up my workstations with Puppet has finally paid off. What I actually did was to restore two critical directories:
/etc/ssh
/var/lib/puppet
So that I would keep the previous machine's identity. That way I could contact the Puppet server and install whatever was missing. I used my Puppet optimization trick to do a batch install and then I had a good base setup, although not exactly as it was before. 1700 packages were installed manually on angela before the reinstall, and not in Puppet. I did not inspect each one individually, but I did go through /etc and copied over more SSH keys, for backups and SMTP over SSH.

LVFS support It looks like there's support for the (de-facto) standard LVFS firmware update system. At least I was able to update the UEFI firmware with a simple:
apt install fwupd-amd64-signed
fwupdmgr refresh
fwupdmgr get-updates
fwupdmgr update
Nice. The 12th gen BIOS updates, currently (January 2023) beta, can be deployed through LVFS with:
fwupdmgr enable-remote lvfs-testing
echo 'DisableCapsuleUpdateOnDisk=true' >> /etc/fwupd/uefi_capsule.conf 
fwupdmgr update
Those instructions come from the beta forum post. I performed the BIOS update on 2023-01-16T16:00-0500.

Resolution tweaks The Framework laptop resolution (2256px X 1504px) is big enough to give you a pretty small font size, so welcome to the marvelous world of "scaling". The Debian wiki page has a few tricks for this.

Console This will make the console and grub fonts more readable:
cat >> /etc/default/console-setup <<EOF
FONTFACE="Terminus"
FONTSIZE=32x16
EOF
echo GRUB_GFXMODE=1024x768 >> /etc/default/grub
update-grub

Xorg Adding this to your .Xresources will make everything look much bigger:
! 1.5*96
Xft.dpi: 144
Apparently, some of this can also help:
! These might also be useful depending on your monitor and personal preference:
Xft.autohint: 0
Xft.lcdfilter:  lcddefault
Xft.hintstyle:  hintfull
Xft.hinting: 1
Xft.antialias: 1
Xft.rgba: rgb
It my experience it also makes things look a little fuzzier, which is frustrating because you have this awesome monitor but everything looks out of focus. Just bumping Xft.dpi by a 1.5 factor looks good to me. The Debian Wiki has a page on HiDPI, but it's not as good as the Arch Wiki, where the above blurb comes from. I am not using the latter because I suspect it's causing some of the "fuzziness". TODO: find the equivalent of this GNOME hack in i3? (gsettings set org.gnome.mutter experimental-features "['scale-monitor-framebuffer']"), taken from this Framework guide

Issues

BIOS configuration The Framework BIOS has some minor issues. One issue I personally encountered is that I had disabled Quick boot and Quiet boot in the BIOS to diagnose the above boot issues. This, in turn, triggers a bug where the BIOS boot manager (F12) would just hang completely. It would also fail to boot from an external USB drive. The current fix (as of BIOS 3.03) is to re-enable both Quick boot and Quiet boot. Presumably this is something that will get fixed in a future BIOS update. Note that the following keybindings are active in the BIOS POST check:
Key Meaning
F2 Enter BIOS setup menu
F12 Enter BIOS boot manager
Delete Enter BIOS setup menu

WiFi compatibility issues I couldn't make WiFi work at first. Obviously, the default Debian installer doesn't ship with proprietary firmware (although that might change soon) so the WiFi card didn't work out of the box. But even after copying the firmware through a USB stick, I couldn't quite manage to find the right combination of ip/iw/wpa-supplicant (yes, after repeatedly copying a bunch more packages over to get those bootstrapped). (Next time I should probably try something like this post.) Thankfully, I had a little USB-C dongle with a RJ-45 jack lying around. That also required a firmware blob, but it was a single package to copy over, and with that loaded, I had network. Eventually, I did managed to make WiFi work; the problem was more on the side of "I forgot how to configure a WPA network by hand from the commandline" than anything else. NetworkManager worked fine and got WiFi working correctly. Note that this is with Debian bookworm, which has the 5.19 Linux kernel, and with the firmware-nonfree (firmware-iwlwifi, specifically) package.

Battery life I was having between about 7 hours of battery on the Purism Librem 13v4, and that's after a year or two of battery life. Now, I still have about 7 hours of battery life, which is nicer than my old ThinkPad X220 (20 minutes!) but really, it's not that good for a new generation laptop. The 12th generation Intel chipset probably improved things compared to the previous one Framework laptop, but I don't have a 11th gen Framework to compare with). (Note that those are estimates from my status bar, not wall clock measurements. They should still be comparable between the Purism and Framework, that said.) The battery life doesn't seem up to, say, Dell XPS 13, ThinkPad X1, and of course not the Apple M1, where I would expect 10+ hours of battery life out of the box. That said, I do get those kind estimates when the machine is fully charged and idle. In fact, when everything is quiet and nothing is plugged in, I get dozens of hours of battery life estimated (I've seen 25h!). So power usage fluctuates quite a bit depending on usage, which I guess is expected. Concretely, so far, light web browsing, reading emails and writing notes in Emacs (e.g. this file) takes about 8W of power:
Time    User  Nice   Sys  Idle    IO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts
-------- ----- ----- ----- ----- ----- ---- ------ ------ ---- ---- ---- ------
 Average   1.7   0.0   0.5  97.6   0.2  1.2 4684.9 1985.2 126.6 39.1 128.0   7.57
 GeoMean   1.4   0.0   0.4  97.6   0.1  1.2 4416.6 1734.5 111.6 27.9 113.3   7.54
  StdDev   1.0   0.2   0.2   1.2   0.0  0.5 1584.7 1058.3 82.1 44.0 80.2   0.71
-------- ----- ----- ----- ----- ----- ---- ------ ------ ---- ---- ---- ------
 Minimum   0.2   0.0   0.2  94.9   0.1  1.0 2242.0  698.2 82.0 17.0 82.0   6.36
 Maximum   4.1   1.1   1.0  99.4   0.2  3.0 8687.4 4445.1 463.0 249.0 449.0   9.10
-------- ----- ----- ----- ----- ----- ---- ------ ------ ---- ---- ---- ------
Summary:
System:   7.57 Watts on average with standard deviation 0.71
Expansion cards matter a lot in the battery life (see below for a thorough discussion), my normal setup is 2xUSB-C and 1xUSB-A (yes, with an empty slot, and yes, to save power). Interestingly, playing a video in a (720p) window in a window takes up more power (10.5W) than in full screen (9.5W) but I blame that on my desktop setup (i3 + compton)... Not sure if mpv hits the VA-API, maybe not in windowed mode. Similar results with 1080p, interestingly, except the window struggles to keep up altogether. Full screen playback takes a relatively comfortable 9.5W, which means a solid 5h+ of playback, which is fine by me. Fooling around the web, small edits, youtube-dl, and I'm at around 80% battery after about an hour, with an estimated 5h left, which is a little disappointing. I had a 7h remaining estimate before I started goofing around Discourse, so I suspect the website is a pretty big battery drain, actually. I see about 10-12 W, while I was probably at half that (6-8W) just playing music with mpv in the background... In other words, it looks like editing posts in Discourse with Firefox takes a solid 4-6W of power. Amazing and gross. (When writing about abusive power usage generates more power usage, is that an heisenbug? Or schr dinbug?)

Power management Compared to the Purism Librem 13v4, the ongoing power usage seems to be slightly better. An anecdotal metric is that the Purism would take 800mA idle, while the more powerful Framework manages a little over 500mA as I'm typing this, fluctuating between 450 and 600mA. That is without any active expansion card, except the storage. Those numbers come from the output of tlp-stat -b and, unfortunately, the "ampere" unit makes it quite hard to compare those, because voltage is not necessarily the same between the two platforms.
  • TODO: review Arch Linux's tips on power saving
  • TODO: i915 driver has a lot of parameters, including some about power saving, see, again, the arch wiki, and particularly enable_fbc=1
TL:DR; power management on the laptop is an issue, but there's various tweaks you can make to improve it. Try:
  • powertop --auto-tune
  • apt install tlp && systemctl enable tlp
  • nvme.noacpi=1 mem_sleep_default=deep on the kernel command line may help with standby power usage
  • keep only USB-C expansion cards plugged in, all others suck power even when idle
  • consider upgrading the BIOS to latest beta (3.06 at the time of writing), unverified power savings
  • latest Linux kernels (6.2) promise power savings as well (unverified)
Update: also try to follow the official optimization guide. It was made for Ubuntu but will probably also work for your distribution of choice with a few tweaks. They recommend using tlpui but it's not packaged in Debian. There is, however, a Flatpak release. In my case, it resulted in the following diff to tlp.conf: tlp.patch.

Background on CPU architecture There were power problems in the 11th gen Framework laptop, according to this report from Linux After Dark, so the issues with power management on the Framework are not new. The 12th generation Intel CPU (AKA "Alder Lake") is a big-little architecture with "power-saving" and "performance" cores. There used to be performance problems introduced by the scheduler in Linux 5.16 but those were eventually fixed in 5.18, which uses Intel's hardware as an "intelligent, low-latency hardware-assisted scheduler". According to Phoronix, the 5.19 release improved the power saving, at the cost of some penalty cost. There were also patch series to make the scheduler configurable, but it doesn't look those have been merged as of 5.19. There was also a session about this at the 2022 Linux Plumbers, but they stopped short of talking more about the specific problems Linux is facing in Alder lake:
Specifically, the kernel's energy-aware scheduling heuristics don't work well on those CPUs. A number of features present there complicate the energy picture; these include SMT, Intel's "turbo boost" mode, and the CPU's internal power-management mechanisms. For many workloads, running on an ostensibly more power-hungry Pcore can be more efficient than using an Ecore. Time for discussion of the problem was lacking, though, and the session came to a close.
All this to say that the 12gen Intel line shipped with this Framework series should have better power management thanks to its power-saving cores. And Linux has had the scheduler changes to make use of this (but maybe is still having trouble). In any case, this might not be the source of power management problems on my laptop, quite the opposite. Also note that the firmware updates for various chipsets are supposed to improve things eventually. On the other hand, The Verge simply declared the whole P-series a mistake...

Attempts at improving power usage I did try to follow some of the tips in this forum post. The tricks powertop --auto-tune and tlp's PCIE_ASPM_ON_BAT=powersupersave basically did nothing: I was stuck at 10W power usage in powertop (600+mA in tlp-stat). Apparently, I should be able to reach the C8 CPU power state (or even C9, C10) in powertop, but I seem to be stock at C7. (Although I'm not sure how to read that tab in powertop: in the Core(HW) column there's only C3/C6/C7 states, and most cores are 85% in C7 or maybe C6. But the next column over does show many CPUs in C10 states... As it turns out, the graphics card actually takes up a good chunk of power unless proper power management is enabled (see below). After tweaking this, I did manage to get down to around 7W power usage in powertop. Expansion cards actually do take up power, and so does the screen, obviously. The fully-lit screen takes a solid 2-3W of power compared to the fully dimmed screen. When removing all expansion cards and making the laptop idle, I can spin it down to 4 watts power usage at the moment, and an amazing 2 watts when the screen turned off.

Caveats Abusive (10W+) power usage that I initially found could be a problem with my desktop configuration: I have this silly status bar that updates every second and probably causes redraws... The CPU certainly doesn't seem to spin down below 1GHz. Also note that this is with an actual desktop running with everything: it could very well be that some things (I'm looking at you Signal Desktop) take up unreasonable amount of power on their own (hello, 1W/electron, sheesh). Syncthing and containerd (Docker!) also seem to take a good 500mW just sitting there. Beyond my desktop configuration, this could, of course, be a Debian-specific problem; your favorite distribution might be better at power management.

Idle power usage tests Some expansion cards waste energy, even when unused. Here is a summary of the findings from the powerstat page. I also include other devices tested in this page for completeness:
Device Minimum Average Max Stdev Note
Screen, 100% 2.4W 2.6W 2.8W N/A
Screen, 1% 30mW 140mW 250mW N/A
Backlight 1 290mW ? ? ? fairly small, all things considered
Backlight 2 890mW 1.2W 3W? 460mW? geometric progression
Backlight 3 1.69W 1.5W 1.8W? 390mW? significant power use
Radios 100mW 250mW N/A N/A
USB-C N/A N/A N/A N/A negligible power drain
USB-A 10mW 10mW ? 10mW almost negligible
DisplayPort 300mW 390mW 600mW N/A not passive
HDMI 380mW 440mW 1W? 20mW not passive
1TB SSD 1.65W 1.79W 2W 12mW significant, probably higher when busy
MicroSD 1.6W 3W 6W 1.93W highest power usage, possibly even higher when busy
Ethernet 1.69W 1.64W 1.76W N/A comparable to the SSD card
So it looks like all expansion cards but the USB-C ones are active, i.e. they draw power with idle. The USB-A cards are the least concern, sucking out 10mW, pretty much within the margin of error. But both the DisplayPort and HDMI do take a few hundred miliwatts. It looks like USB-A connectors have this fundamental flaw that they necessarily draw some powers because they lack the power negotiation features of USB-C. At least according to this post:
It seems the USB A must have power going to it all the time, that the old USB 2 and 3 protocols, the USB C only provides power when there is a connection. Old versus new.
Apparently, this is a problem specific to the USB-C to USB-A adapter that ships with the Framework. Some people have actually changed their orders to all USB-C because of this problem, but I'm not sure the problem is as serious as claimed in the forums. I couldn't reproduce the "one watt" power drains suggested elsewhere, at least not repeatedly. (A previous version of this post did show such a power drain, but it was in a less controlled test environment than the series of more rigorous tests above.) The worst offenders are the storage cards: the SSD drive takes at least one watt of power and the MicroSD card seems to want to take all the way up to 6 watts of power, both just sitting there doing nothing. This confirms claims of 1.4W for the SSD (but not 5W) power usage found elsewhere. The former post has instructions on how to disable the card in software. The MicroSD card has been reported as using 2 watts, but I've seen it as high as 6 watts, which is pretty damning. The Framework team has a beta update for the DisplayPort adapter but currently only for Windows (LVFS technically possible, "under investigation"). A USB-A firmware update is also under investigation. It is therefore likely at least some of those power management issues will eventually be fixed. Note that the upcoming Ethernet card has a reported 2-8W power usage, depending on traffic. I did my own power usage tests in powerstat-wayland and they seem lower than 2W. The upcoming 6.2 Linux kernel might also improve battery usage when idle, see this Phoronix article for details, likely in early 2023.

Idle power usage tests under Wayland Update: I redid those tests under Wayland, see powerstat-wayland for details. The TL;DR: is that power consumption is either smaller or similar.

Idle power usage tests, 3.06 beta BIOS I redid the idle tests after the 3.06 beta BIOS update and ended up with this results:
Device Minimum Average Max Stdev Note
Baseline 1.96W 2.01W 2.11W 30mW 1 USB-C, screen off, backlight off, no radios
2 USB-C 1.95W 2.16W 3.69W 430mW USB-C confirmed as mostly passive...
3 USB-C 1.95W 2.16W 3.69W 430mW ... although with extra stdev
1TB SSD 3.72W 3.85W 4.62W 200mW unchanged from before upgrade
1 USB-A 1.97W 2.18W 4.02W 530mW unchanged
2 USB-A 1.97W 2.00W 2.08W 30mW unchanged
3 USB-A 1.94W 1.99W 2.03W 20mW unchanged
MicroSD w/o card 3.54W 3.58W 3.71W 40mW significant improvement! 2-3W power saving!
MicroSD w/ card 3.53W 3.72W 5.23W 370mW new measurement! increased deviation
DisplayPort 2.28W 2.31W 2.37W 20mW unchanged
1 HDMI 2.43W 2.69W 4.53W 460mW unchanged
2 HDMI 2.53W 2.59W 2.67W 30mW unchanged
External USB 3.85W 3.89W 3.94W 30mW new result
Ethernet 3.60W 3.70W 4.91W 230mW unchanged
Note that the table summary is different than the previous table: here we show the absolute numbers while the previous table was doing a confusing attempt at showing relative (to the baseline) numbers. Conclusion: the 3.06 BIOS update did not significantly change idle power usage stats except for the MicroSD card which has significantly improved. The new "external USB" test is also interesting: it shows how the provided 1TB SSD card performs (admirably) compared to existing devices. The other new result is the MicroSD card with a card which, interestingly, uses less power than the 1TB SSD drive.

Standby battery usage I wrote some quick hack to evaluate how much power is used during sleep. Apparently, this is one of the areas that should have improved since the first Framework model, let's find out. My baseline for comparison is the Purism laptop, which, in 10 minutes, went from this:
sep 28 11:19:45 angela systemd-sleep[209379]: /sys/class/power_supply/BAT/charge_now                      =   6045 [mAh]
... to this:
sep 28 11:29:47 angela systemd-sleep[209725]: /sys/class/power_supply/BAT/charge_now                      =   6037 [mAh]
That's 8mAh per 10 minutes (and 2 seconds), or 48mA, or, with this battery, about 127 hours or roughly 5 days of standby. Not bad! In comparison, here is my really old x220, before:
sep 29 22:13:54 emma systemd-sleep[176315]: /sys/class/power_supply/BAT0/energy_now                     =   5070 [mWh]
... after:
sep 29 22:23:54 emma systemd-sleep[176486]: /sys/class/power_supply/BAT0/energy_now                     =   4980 [mWh]
... which is 90 mwH in 10 minutes, or a whopping 540mA, which was possibly okay when this battery was new (62000 mAh, so about 100 hours, or about 5 days), but this battery is almost dead and has only 5210 mAh when full, so only 10 hours standby. And here is the Framework performing a similar test, before:
sep 29 22:27:04 angela systemd-sleep[4515]: /sys/class/power_supply/BAT1/charge_full                    =   3518 [mAh]
sep 29 22:27:04 angela systemd-sleep[4515]: /sys/class/power_supply/BAT1/charge_now                     =   2861 [mAh]
... after:
sep 29 22:37:08 angela systemd-sleep[4743]: /sys/class/power_supply/BAT1/charge_now                     =   2812 [mAh]
... which is 49mAh in a little over 10 minutes (and 4 seconds), or 292mA, much more than the Purism, but half of the X220. At this rate, the battery would last on standby only 12 hours!! That is pretty bad. Note that this was done with the following expansion cards:
  • 2 USB-C
  • 1 1TB SSD drive
  • 1 USB-A with a hub connected to it, with keyboard and LAN
Preliminary tests without the hub (over one minute) show that it doesn't significantly affect this power consumption (300mA). This guide also suggests booting with nvme.noacpi=1 but this still gives me about 5mAh/min (or 300mA). Adding mem_sleep_default=deep to the kernel command line does make a difference. Before:
sep 29 23:03:11 angela systemd-sleep[3699]: /sys/class/power_supply/BAT1/charge_now                     =   2544 [mAh]
... after:
sep 29 23:04:25 angela systemd-sleep[4039]: /sys/class/power_supply/BAT1/charge_now                     =   2542 [mAh]
... which is 2mAh in 74 seconds, which is 97mA, brings us to a more reasonable 36 hours, or a day and a half. It's still above the x220 power usage, and more than an order of magnitude more than the Purism laptop. It's also far from the 0.4% promised by upstream, which would be 14mA for the 3500mAh battery. It should also be noted that this "deep" sleep mode is a little more disruptive than regular sleep. As you can see by the timing, it took more than 10 seconds for the laptop to resume, which feels a little alarming as your banging the keyboard to bring it back to life. You can confirm the current sleep mode with:
# cat /sys/power/mem_sleep
s2idle [deep]
In the above, deep is selected. You can change it on the fly with:
printf s2idle > /sys/power/mem_sleep
Here's another test:
sep 30 22:25:50 angela systemd-sleep[32207]: /sys/class/power_supply/BAT1/charge_now                     =   1619 [mAh]
sep 30 22:31:30 angela systemd-sleep[32516]: /sys/class/power_supply/BAT1/charge_now                     =   1613 [mAh]
... better! 6 mAh in about 6 minutes, works out to 63.5mA, so more than two days standby. A longer test:
oct 01 09:22:56 angela systemd-sleep[62978]: /sys/class/power_supply/BAT1/charge_now                     =   3327 [mAh]
oct 01 12:47:35 angela systemd-sleep[63219]: /sys/class/power_supply/BAT1/charge_now                     =   3147 [mAh]
That's 180mAh in about 3.5h, 52mA! Now at 66h, or almost 3 days. I wasn't sure why I was seeing such fluctuations in those tests, but as it turns out, expansion card power tests show that they do significantly affect power usage, especially the SSD drive, which can take up to two full watts of power even when idle. I didn't control for expansion cards in the above tests running them with whatever card I had plugged in without paying attention so it's likely the cause of the high power usage and fluctuations. It might be possible to work around this problem by disabling USB devices before suspend. TODO. See also this post. In the meantime, I have been able to get much better suspend performance by unplugging all modules. Then I get this result:
oct 04 11:15:38 angela systemd-sleep[257571]: /sys/class/power_supply/BAT1/charge_now                     =   3203 [mAh]
oct 04 15:09:32 angela systemd-sleep[257866]: /sys/class/power_supply/BAT1/charge_now                     =   3145 [mAh]
Which is 14.8mA! Almost exactly the number promised by Framework! With a full battery, that means a 10 days suspend time. This is actually pretty good, and far beyond what I was expecting when starting down this journey. So, once the expansion cards are unplugged, suspend power usage is actually quite reasonable. More detailed standby tests are available in the standby-tests page, with a summary below. There is also some hope that the Chromebook edition specifically designed with a specification of 14 days standby time could bring some firmware improvements back down to the normal line. Some of those issues were reported upstream in April 2022, but there doesn't seem to have been any progress there since. TODO: one final solution here is suspend-then-hibernate, which Windows uses for this TODO: consider implementing the S0ix sleep states , see also troubleshooting TODO: consider https://github.com/intel/pm-graph

Standby expansion cards test results This table is a summary of the more extensive standby-tests I have performed:
Device Wattage Amperage Days Note
baseline 0.25W 16mA 9 sleep=deep nvme.noacpi=1
s2idle 0.29W 18.9mA ~7 sleep=s2idle nvme.noacpi=1
normal nvme 0.31W 20mA ~7 sleep=s2idle without nvme.noacpi=1
1 USB-C 0.23W 15mA ~10
2 USB-C 0.23W 14.9mA same as above
1 USB-A 0.75W 48.7mA 3 +500mW (!!) for the first USB-A card!
2 USB-A 1.11W 72mA 2 +360mW
3 USB-A 1.48W 96mA <2 +370mW
1TB SSD 0.49W 32mA <5 +260mW
MicroSD 0.52W 34mA ~4 +290mW
DisplayPort 0.85W 55mA <3 +620mW (!!)
1 HDMI 0.58W 38mA ~4 +250mW
2 HDMI 0.65W 42mA <4 +70mW (?)
Conclusions:
  • USB-C cards take no extra power on suspend, possibly less than empty slots, more testing required
  • USB-A cards take a lot more power on suspend (300-500mW) than on regular idle (~10mW, almost negligible)
  • 1TB SSD and MicroSD cards seem to take a reasonable amount of power (260-290mW), compared to their runtime equivalents (1-6W!)
  • DisplayPort takes a surprising lot of power (620mW), almost double its average runtime usage (390mW)
  • HDMI cards take, surprisingly, less power (250mW) in standby than the DP card (620mW)
  • and oddly, a second card adds less power usage (70mW?!) than the first, maybe a circuit is used by both?
A discussion of those results is in this forum post.

Standby expansion cards test results, 3.06 beta BIOS Framework recently (2022-11-07) announced that they will publish a firmware upgrade to address some of the USB-C issues, including power management. This could positively affect the above result, improving both standby and runtime power usage. The update came out in December 2022 and I redid my analysis with the following results:
Device Wattage Amperage Days Note
baseline 0.25W 16mA 9 no cards, same as before upgrade
1 USB-C 0.25W 16mA 9 same as before
2 USB-C 0.25W 16mA 9 same
1 USB-A 0.80W 62mA 3 +550mW!! worse than before
2 USB-A 1.12W 73mA <2 +320mW, on top of the above, bad!
Ethernet 0.62W 40mA 3-4 new result, decent
1TB SSD 0.52W 34mA 4 a bit worse than before (+2mA)
MicroSD 0.51W 22mA 4 same
DisplayPort 0.52W 34mA 4+ upgrade improved by 300mW
1 HDMI ? 38mA ? same
2 HDMI ? 45mA ? a bit worse than before (+3mA)
Normal 1.08W 70mA ~2 Ethernet, 2 USB-C, USB-A
Full results in standby-tests-306. The big takeaway for me is that the update did not improve power usage on the USB-A ports which is a big problem for my use case. There is a notable improvement on the DisplayPort power consumption which brings it more in line with the HDMI connector, but it still doesn't properly turn off on suspend either. Even worse, the USB-A ports now sometimes fails to resume after suspend, which is pretty annoying. This is a known problem that will hopefully get fixed in the final release.

Battery wear protection The BIOS has an option to limit charge to 80% to mitigate battery wear. There's a way to control the embedded controller from runtime with fw-ectool, partly documented here. The command would be:
sudo ectool fwchargelimit 80
I looked at building this myself but failed to run it. I opened a RFP in Debian so that we can ship this in Debian, and also documented my work there. Note that there is now a counter that tracks charge/discharge cycles. It's visible in tlp-stat -b, which is a nice improvement:
root@angela:/home/anarcat# tlp-stat -b
--- TLP 1.5.0 --------------------------------------------
+++ Battery Care
Plugin: generic
Supported features: none available
+++ Battery Status: BAT1
/sys/class/power_supply/BAT1/manufacturer                   = NVT
/sys/class/power_supply/BAT1/model_name                     = Framewo
/sys/class/power_supply/BAT1/cycle_count                    =      3
/sys/class/power_supply/BAT1/charge_full_design             =   3572 [mAh]
/sys/class/power_supply/BAT1/charge_full                    =   3541 [mAh]
/sys/class/power_supply/BAT1/charge_now                     =   1625 [mAh]
/sys/class/power_supply/BAT1/current_now                    =    178 [mA]
/sys/class/power_supply/BAT1/status                         = Discharging
/sys/class/power_supply/BAT1/charge_control_start_threshold = (not available)
/sys/class/power_supply/BAT1/charge_control_end_threshold   = (not available)
Charge                                                      =   45.9 [%]
Capacity                                                    =   99.1 [%]
One thing that is still missing is the charge threshold data (the (not available) above). There's been some work to make that accessible in August, stay tuned? This would also make it possible implement hysteresis support.

Ethernet expansion card The Framework ethernet expansion card is a fancy little doodle: "2.5Gbit/s and 10/100/1000Mbit/s Ethernet", the "clear housing lets you peek at the RTL8156 controller that powers it". Which is another way to say "we didn't completely finish prod on this one, so it kind of looks like we 3D-printed this in the shop".... The card is a little bulky, but I guess that's inevitable considering the RJ-45 form factor when compared to the thin Framework laptop. I have had a serious issue when trying it at first: the link LEDs just wouldn't come up. I made a full bug report in the forum and with upstream support, but eventually figured it out on my own. It's (of course) a power saving issue: if you reboot the machine, the links come up when the laptop is running the BIOS POST check and even when the Linux kernel boots. I first thought that the problem is likely related to the powertop service which I run at boot time to tweak some power saving settings. It seems like this:
echo 'on' > '/sys/bus/usb/devices/4-2/power/control'
... is a good workaround to bring the card back online. You can even return to power saving mode and the card will still work:
echo 'auto' > '/sys/bus/usb/devices/4-2/power/control'
Further research by Matt_Hartley from the Framework Team found this issue in the tlp tracker that shows how the USB_AUTOSUSPEND setting enables the power saving even if the driver doesn't support it, which, in retrospect, just sounds like a bad idea. To quote that issue:
By default, USB power saving is active in the kernel, but not force-enabled for incompatible drivers. That is, devices that support suspension will suspend, drivers that do not, will not.
So the fix is actually to uninstall tlp or disable that setting by adding this to /etc/tlp.conf:
USB_AUTOSUSPEND=0
... but that disables auto-suspend on all USB devices, which may hurt other power usage performance. I have found that a a combination of:
USB_AUTOSUSPEND=1
USB_DENYLIST="0bda:8156"
and this on the kernel commandline:
usbcore.quirks=0bda:8156:k
... actually does work correctly. I now have this in my /etc/default/grub.d/framework-tweaks.cfg file:
# net.ifnames=0: normal interface names ffs (e.g. eth0, wlan0, not wlp166
s0)
# nvme.noacpi=1: reduce SSD disk power usage (not working)
# mem_sleep_default=deep: reduce power usage during sleep (not working)
# usbcore.quirk is a workaround for the ethernet card suspend bug: https:
//guides.frame.work/Guide/Fedora+37+Installation+on+the+Framework+Laptop/
108?lang=en
GRUB_CMDLINE_LINUX="net.ifnames=0 nvme.noacpi=1 mem_sleep_default=deep usbcore.quirks=0bda:8156:k"
# fix the resolution in grub for fonts to not be tiny
GRUB_GFXMODE=1024x768
Other than that, I haven't been able to max out the card because I don't have other 2.5Gbit/s equipment at home, which is strangely satisfying. But running against my Turris Omnia router, I could pretty much max a gigabit fairly easily:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.09 GBytes   937 Mbits/sec  238             sender
[  5]   0.00-10.00  sec  1.09 GBytes   934 Mbits/sec                  receiver
The card doesn't require any proprietary firmware blobs which is surprising. Other than the power saving issues, it just works. In my power tests (see powerstat-wayland), the Ethernet card seems to use about 1.6W of power idle, without link, in the above "quirky" configuration where the card is functional but without autosuspend.

Proprietary firmware blobs The framework does need proprietary firmware to operate. Specifically:
  • the WiFi network card shipped with the DIY kit is a AX210 card that requires a 5.19 kernel or later, and the firmware-iwlwifi non-free firmware package
  • the Bluetooth adapter also loads the firmware-iwlwifi package (untested)
  • the graphics work out of the box without firmware, but certain power management features come only with special proprietary firmware, normally shipped in the firmware-misc-nonfree but currently missing from the package
Note that, at the time of writing, the latest i915 firmware from linux-firmware has a serious bug where loading all the accessible firmware results in noticeable I estimate 200-500ms lag between the keyboard (not the mouse!) and the display. Symptoms also include tearing and shearing of windows, it's pretty nasty. One workaround is to delete the two affected firmware files:
cd /lib/firmware && rm adlp_guc_70.1.1.bin adlp_guc_69.0.3.bin
update-initramfs -u
You will get the following warning during build, which is good as it means the problematic firmware is disabled:
W: Possible missing firmware /lib/firmware/i915/adlp_guc_69.0.3.bin for module i915
W: Possible missing firmware /lib/firmware/i915/adlp_guc_70.1.1.bin for module i915
But then it also means that critical firmware isn't loaded, which means, among other things, a higher battery drain. I was able to move from 8.5-10W down to the 7W range after making the firmware work properly. This is also after turning the backlight all the way down, as that takes a solid 2-3W in full blast. The proper fix is to use some compositing manager. I ended up using compton with the following systemd unit:
[Unit]
Description=start compositing manager
PartOf=graphical-session.target
ConditionHost=angela
[Service]
Type=exec
ExecStart=compton --show-all-xerrors --backend glx --vsync opengl-swc
Restart=on-failure
[Install]
RequiredBy=graphical-session.target
compton is orphaned however, so you might be tempted to use picom instead, but in my experience the latter uses much more power (1-2W extra, similar experience). I also tried compiz but it would just crash with:
anarcat@angela:~$ compiz --replace
compiz (core) - Warn: No XI2 extension
compiz (core) - Error: Another composite manager is already running on screen: 0
compiz (core) - Fatal: No manageable screens found on display :0
When running from the base session, I would get this instead:
compiz (core) - Warn: No XI2 extension
compiz (core) - Error: Couldn't load plugin 'ccp'
compiz (core) - Error: Couldn't load plugin 'ccp'
Thanks to EmanueleRocca for figuring all that out. See also this discussion about power management on the Framework forum. Note that Wayland environments do not require any special configuration here and actually work better, see my Wayland migration notes for details.
Also note that the iwlwifi firmware also looks incomplete. Even with the package installed, I get those errors in dmesg:
[   19.534429] Intel(R) Wireless WiFi driver for Linux
[   19.534691] iwlwifi 0000:a6:00.0: enabling device (0000 -> 0002)
[   19.541867] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-72.ucode (-2)
[   19.541881] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-72.ucode (-2)
[   19.541882] iwlwifi 0000:a6:00.0: Direct firmware load for iwlwifi-ty-a0-gf-a0-72.ucode failed with error -2
[   19.541890] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-71.ucode (-2)
[   19.541895] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-71.ucode (-2)
[   19.541896] iwlwifi 0000:a6:00.0: Direct firmware load for iwlwifi-ty-a0-gf-a0-71.ucode failed with error -2
[   19.541903] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-70.ucode (-2)
[   19.541907] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-70.ucode (-2)
[   19.541908] iwlwifi 0000:a6:00.0: Direct firmware load for iwlwifi-ty-a0-gf-a0-70.ucode failed with error -2
[   19.541913] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-69.ucode (-2)
[   19.541916] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-69.ucode (-2)
[   19.541917] iwlwifi 0000:a6:00.0: Direct firmware load for iwlwifi-ty-a0-gf-a0-69.ucode failed with error -2
[   19.541922] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-68.ucode (-2)
[   19.541926] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-68.ucode (-2)
[   19.541927] iwlwifi 0000:a6:00.0: Direct firmware load for iwlwifi-ty-a0-gf-a0-68.ucode failed with error -2
[   19.541933] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-67.ucode (-2)
[   19.541937] iwlwifi 0000:a6:00.0: firmware: failed to load iwlwifi-ty-a0-gf-a0-67.ucode (-2)
[   19.541937] iwlwifi 0000:a6:00.0: Direct firmware load for iwlwifi-ty-a0-gf-a0-67.ucode failed with error -2
[   19.544244] iwlwifi 0000:a6:00.0: firmware: direct-loading firmware iwlwifi-ty-a0-gf-a0-66.ucode
[   19.544257] iwlwifi 0000:a6:00.0: api flags index 2 larger than supported by driver
[   19.544270] iwlwifi 0000:a6:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 0.63.2.1
[   19.544523] iwlwifi 0000:a6:00.0: firmware: failed to load iwl-debug-yoyo.bin (-2)
[   19.544528] iwlwifi 0000:a6:00.0: firmware: failed to load iwl-debug-yoyo.bin (-2)
[   19.544530] iwlwifi 0000:a6:00.0: loaded firmware version 66.55c64978.0 ty-a0-gf-a0-66.ucode op_mode iwlmvm
Some of those are available in the latest upstream firmware package (iwlwifi-ty-a0-gf-a0-71.ucode, -68, and -67), but not all (e.g. iwlwifi-ty-a0-gf-a0-72.ucode is missing) . It's unclear what those do or don't, as the WiFi seems to work well without them. I still copied them in from the latest linux-firmware package in the hope they would help with power management, but I did not notice a change after loading them. There are also multiple knobs on the iwlwifi and iwlmvm drivers. The latter has a power_schmeme setting which defaults to 2 (balanced), setting it to 3 (low power) could improve battery usage as well, in theory. The iwlwifi driver also has power_save (defaults to disabled) and power_level (1-5, defaults to 1) settings. See also the output of modinfo iwlwifi and modinfo iwlmvm for other driver options.

Graphics acceleration After loading the latest upstream firmware and setting up a compositing manager (compton, above), I tested the classic glxgears. Running in a window gives me odd results, as the gears basically grind to a halt:
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
137 frames in 5.1 seconds = 26.984 FPS
27 frames in 5.4 seconds =  5.022 FPS
Ouch. 5FPS! But interestingly, once the window is in full screen, it does hit the monitor refresh rate:
300 frames in 5.0 seconds = 60.000 FPS
I'm not really a gamer and I'm not normally using any of that fancy graphics acceleration stuff (except maybe my browser does?). I installed intel-gpu-tools for the intel_gpu_top command to confirm the GPU was engaged when doing those simulations. A nice find. Other useful diagnostic tools include glxgears and glxinfo (in mesa-utils) and (vainfo in vainfo). Following to this post, I also made sure to have those settings in my about:config in Firefox, or, in user.js:
user_pref("media.ffmpeg.vaapi.enabled", true);
Note that the guide suggests many other settings to tweak, but those might actually be overkill, see this comment and its parents. I did try forcing hardware acceleration by setting gfx.webrender.all to true, but everything became choppy and weird. The guide also mentions installing the intel-media-driver package, but I could not find that in Debian. The Arch wiki has, as usual, an excellent reference on hardware acceleration in Firefox.

Chromium / Signal desktop bugs It looks like both Chromium and Signal Desktop misbehave with my compositor setup (compton + i3). The fix is to add a persistent flag to Chromium. In Arch, it's conveniently in ~/.config/chromium-flags.conf but that doesn't actually work in Debian. I had to put the flag in /etc/chromium.d/disable-compositing, like this:
export CHROMIUM_FLAGS="$CHROMIUM_FLAGS --disable-gpu-compositing"
It's possible another one of the hundreds of flags might fix this issue better, but I don't really have time to go through this entire, incomplete, and unofficial list (!?!). Signal Desktop is a similar problem, and doesn't reuse those flags (because of course it doesn't). Instead I had to rewrite the wrapper script in /usr/local/bin/signal-desktop to use this instead:
exec /usr/bin/flatpak run --branch=stable --arch=x86_64 org.signal.Signal --disable-gpu-compositing "$@"
This was mostly done in this Puppet commit. I haven't figured out the root of this problem. I did try using picom and xcompmgr; they both suffer from the same issue. Another Debian testing user on Wayland told me they haven't seen this problem, so hopefully this can be fixed by switching to wayland.

Graphics card hangs I believe I might have this bug which results in a total graphical hang for 15-30 seconds. It's fairly rare so it's not too disruptive, but when it does happen, it's pretty alarming. The comments on that bug report are encouraging though: it seems this is a bug in either mesa or the Intel graphics driver, which means many people have this problem so it's likely to be fixed. There's actually a merge request on mesa already (2022-12-29). It could also be that bug because the error message I get is actually:
Jan 20 12:49:10 angela kernel: Asynchronous wait on fence 0000:00:02.0:sway[104431]:cb0ae timed out (hint:intel_atomic_commit_ready [i915]) 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] GuC firmware i915/adlp_guc_70.1.1.bin version 70.1 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] HuC firmware i915/tgl_huc_7.9.3.bin version 7.9 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] HuC authenticated 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] GuC submission enabled 
Jan 20 12:49:15 angela kernel: i915 0000:00:02.0: [drm] GuC SLPC enabled
It's a solid 30 seconds graphical hang. Maybe the keyboard and everything else keeps working. The latter bug report is quite long, with many comments, but this one from January 2023 seems to say that Sway 1.8 fixed the problem. There's also an earlier patch to add an extra kernel parameter that supposedly fixes that too. There's all sorts of other workarounds in there, for example this:
echo "options i915 enable_dc=1 enable_guc_loading=1 enable_guc_submission=1 edp_vswing=0 enable_guc=2 enable_fbc=1 enable_psr=1 disable_power_well=0"   sudo tee /etc/modprobe.d/i915.conf
from this comment... So that one is unsolved, as far as the upstream drivers are concerned, but maybe could be fixed through Sway.

Weird USB hangs / graphical glitches I have had weird connectivity glitches better described in this post, but basically: my USB keyboard and mice (connected over a USB hub) drop keys, lag a lot or hang, and I get visual glitches. The fix was to tighten the screws around the CPU on the motherboard (!), which is, thankfully, a rather simple repair.

USB docks are hell Note that the monitors are hooked up to angela through a USB-C / Thunderbolt dock from Cable Matters, with the lovely name of 201053-SIL. It has issues, see this blog post for an in-depth discussion.

Shipping details I ordered the Framework in August 2022 and received it about a month later, which is sooner than expected because the August batch was late. People (including me) expected this to have an impact on the September batch, but it seems Framework have been able to fix the delivery problems and keep up with the demand. As of early 2023, their website announces that laptops ship "within 5 days". I have myself ordered a few expansion cards in November 2022, and they shipped on the same day, arriving 3-4 days later.

The supply pipeline There are basically 6 steps in the Framework shipping pipeline, each (except the last) accompanied with an email notification:
  1. pre-order
  2. preparing batch
  3. preparing order
  4. payment complete
  5. shipping
  6. (received)
This comes from the crowdsourced spreadsheet, which should be updated when the status changes here. I was part of the "third batch" of the 12th generation laptop, which was supposed to ship in September. It ended up arriving on my door step on September 27th, about 33 days after ordering. It seems current orders are not processed in "batches", but in real time, see this blog post for details on shipping.

Shipping trivia I don't know about the others, but my laptop shipped through no less than four different airplane flights. Here are the hops it took: I can't quite figure out how to calculate exactly how much mileage that is, but it's huge. The ride through Alaska is surprising enough but the bounce back through Winnipeg is especially weird. I guess the route happens that way because of Fedex shipping hubs. There was a related oddity when I had my Purism laptop shipped: it left from the west coast and seemed to enter on an endless, two week long road trip across the continental US.

Other resources

6 February 2023

Reproducible Builds: Reproducible Builds in January 2023

Welcome to the first report for 2023 from the Reproducible Builds project! In these reports we try and outline the most important things that we have been up to over the past month, as well as the most important things in/around the community. As a quick recap, the motivation behind the reproducible builds effort is to ensure no malicious flaws can be deliberately introduced during compilation and distribution of the software that we run on our devices. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.


News In a curious turn of events, GitHub first announced this month that the checksums of various Git archives may be subject to change, specifically that because:
the default compression for Git archives has recently changed. As result, archives downloaded from GitHub may have different checksums even though the contents are completely unchanged.
This change (which was brought up on our mailing list last October) would have had quite wide-ranging implications for anyone wishing to validate and verify downloaded archives using cryptographic signatures. However, GitHub reversed this decision, updating their original announcement with a message that We are reverting this change for now. More details to follow. It appears that this was informed in part by an in-depth discussion in the GitHub Community issue tracker.
The Bundesamt f r Sicherheit in der Informationstechnik (BSI) (trans: The Federal Office for Information Security ) is the agency in charge of managing computer and communication security for the German federal government. They recently produced a report that touches on attacks on software supply-chains (Supply-Chain-Angriff). (German PDF)
Contributor Seb35 updated our website to fix broken links to Tails Git repository [ ][ ], and Holger updated a large number of pages around our recent summit in Venice [ ][ ][ ][ ].
Noak J nsson has written an interesting paper entitled The State of Software Diversity in the Software Supply Chain of Ethereum Clients. As the paper outlines:
In this report, the software supply chains of the most popular Ethereum clients are cataloged and analyzed. The dependency graphs of Ethereum clients developed in Go, Rust, and Java, are studied. These client are Geth, Prysm, OpenEthereum, Lighthouse, Besu, and Teku. To do so, their dependency graphs are transformed into a unified format. Quantitative metrics are used to depict the software supply chain of the blockchain. The results show a clear difference in the size of the software supply chain required for the execution layer and consensus layer of Ethereum.

Yongkui Han posted to our mailing list discussing making reproducible builds & GitBOM work together without gitBOM-ID embedding. GitBOM (now renamed to OmniBOR) is a project to enable automatic, verifiable artifact resolution across today s diverse software supply-chains [ ]. In addition, Fabian Keil wrote to us asking whether anyone in the community would be at Chemnitz Linux Days 2023, which is due to take place on 11th and 12th March (event info). Separate to this, Akihiro Suda posted to our mailing list just after the end of the month with a status report of bit-for-bit reproducible Docker/OCI images. As Akihiro mentions in their post, they will be giving a talk at FOSDEM in the Containers devroom titled Bit-for-bit reproducible builds with Dockerfile and that my talk will also mention how to pin the apt/dnf/apk/pacman packages with my repro-get tool.
The extremely popular Signal messenger app added upstream support for the SOURCE_DATE_EPOCH environment variable this month. This means that release tarballs of the Signal desktop client do not embed nondeterministic release information. [ ][ ]

Distribution work

F-Droid & Android There was a very large number of changes in the F-Droid and wider Android ecosystem this month: On January 15th, a blog post entitled Towards a reproducible F-Droid was published on the F-Droid website, outlining the reasons why F-Droid signs published APKs with its own keys and how reproducible builds allow using upstream developers keys instead. In particular:
In response to [ ] criticisms, we started encouraging new apps to enable reproducible builds. It turns out that reproducible builds are not so difficult to achieve for many apps. In the past few months we ve gotten many more reproducible apps in F-Droid than before. Currently we can t highlight which apps are reproducible in the client, so maybe you haven t noticed that there are many new apps signed with upstream developers keys.
(There was a discussion about this post on Hacker News.) In addition:
  • F-Droid added 13 apps published with reproducible builds this month. [ ]
  • FC Stegerman outlined a bug where baseline.profm files are nondeterministic, developed a workaround, and provided all the details required for a fix. As they note, this issue has now been fixed but the fix is not yet part of an official Android Gradle plugin release.
  • GitLab user Parwor discovered that the number of CPU cores can affect the reproducibility of .dex files. [ ]
  • FC Stegerman also announced the 0.2.0 and 0.2.1 releases of reproducible-apk-tools, a suite of tools to help make .apk files reproducible. Several new subcommands and scripts were added, and a number of bugs were fixed as well [ ][ ]. They also updated the F-Droid website to improve the reproducibility-related documentation. [ ][ ]
  • On the F-Droid issue tracker, FC Stegerman discussed reproducible builds with one of the developers of the Threema messenger app and reported that Android SDK build-tools 31.0.0 and 32.0.0 (unlike earlier and later versions) have a zipalign command that produces incorrect padding.
  • A number of bugs related to reproducibility were discovered in Android itself. Firstly, the non-deterministic order of .zip entries in .apk files [ ] and then newline differences between building on Windows versus Linux that can make builds not reproducible as well. [ ] (Note that these links may require a Google account to view.)
  • And just before the end of the month, FC Stegerman started a thread on our mailing list on the topic of hiding data/code in APK embedded signatures which has been made possible by the Android APK Signature Scheme v2/v3. As part of this, they made an Android app that reads the APK Signing block of its own APK and extracts a payload in order to alter its behaviour called sigblock-code-poc.

Debian As mentioned in last month s report, Vagrant Cascadian has been organising a series of online sprints in order to clear the huge backlog of reproducible builds patches submitted by performing NMUs (Non-Maintainer Uploads). During January, a sprint took place on the 10th, resulting in the following uploads: During this sprint, Holger Levsen filed Debian bug #1028615 to request that the tracker.debian.org service display results of reproducible rebuilds, not just reproducible CI results. Elsewhere in Debian, strip-nondeterminism is our tool to remove specific non-deterministic results from a completed build. This month, version 1.13.1-1 was uploaded to Debian unstable by Holger Levsen, including a fix by FC Stegerman (obfusk) to update a regular expression for the latest version of file(1) [ ]. (#1028892) Lastly, 65 reviews of Debian packages were added, 21 were updated and 35 were removed this month adding to our knowledge about identified issues.

Other distributions In other distributions:

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb made the following changes to diffoscope, including preparing and uploading versions 231, 232, 233 and 234 to Debian:
  • No need for from __future__ import print_function import anymore. [ ]
  • Comment and tidy the extras_require.json handling. [ ]
  • Split inline Python code to generate test Recommends into a separate Python script. [ ]
  • Update debian/tests/control after merging support for PyPDF support. [ ]
  • Correctly catch segfaulting cd-iccdump binary. [ ]
  • Drop some old debugging code. [ ]
  • Allow ICC tests to (temporarily) fail. [ ]
In addition, FC Stegerman (obfusk) made a number of changes, including:
  • Updating the test_text_proper_indentation test to support the latest version(s) of file(1). [ ]
  • Use an extras_require.json file to store some build/release metadata, instead of accessing the internet. [ ]
  • Updating an APK-related file(1) regular expression. [ ]
  • On the diffoscope.org website, de-duplicate contributors by e-mail. [ ]
Lastly, Sam James added support for PyPDF version 3 [ ] and Vagrant Cascadian updated a handful of tool references for GNU Guix. [ ][ ]

Upstream patches The Reproducible Builds project attempts to fix as many currently-unreproducible packages as possible. This month, we wrote a large number of such patches, including:

Testing framework The Reproducible Builds project operates a comprehensive testing framework at tests.reproducible-builds.org in order to check packages and other artifacts for reproducibility. In January, the following changes were made by Holger Levsen:
  • Node changes:
  • Debian-related changes:
    • Only keep diffoscope s HTML output (ie. no .json or .txt) for LTS suites and older in order to save diskspace on the Jenkins host. [ ]
    • Re-create pbuilder base less frequently for the stretch, bookworm and experimental suites. [ ]
  • OpenWrt-related changes:
    • Add gcc-multilib to OPENWRT_HOST_PACKAGES and install it on the nodes that need it. [ ]
    • Detect more problems in the health check when failing to build OpenWrt. [ ]
  • Misc changes:
    • Update the chroot-run script to correctly manage /dev and /dev/pts. [ ][ ][ ]
    • Update the Jenkins shell monitor script to collect disk stats less frequently [ ] and to include various directory stats. [ ][ ]
    • Update the real year in the configuration in order to be able to detect whether a node is running in the future or not. [ ]
    • Bump copyright years in the default page footer. [ ]
In addition, Christian Marangi submitted a patch to build OpenWrt packages with the V=s flag to enable debugging. [ ]
If you are interested in contributing to the Reproducible Builds project, please visit the Contribute page on our website. You can get in touch with us via:

30 December 2022

Utkarsh Gupta: FOSS Activites in December 2022

Here s my (thirty-ninth) monthly but brief update about the activities I ve done in the F/L/OSS world.

Debian
This was my 48th month of actively contributing to Debian. I became a DM in late March 2019 and a DD on Christmas 19! \o/ There s a bunch of things I do, both, technical and non-technical. Here are the things I did this month:
  • Some DebConf work.
  • Sponsoring stuff for non-DDs.
  • Mentoring for newcomers.
  • Moderation of -project mailing list.

Ubuntu
This was my 23rd month of actively contributing to Ubuntu. Now that I joined Canonical to work on Ubuntu full-time, there s a bunch of things I do! \o/ I mostly worked on different things, I guess. I was too lazy to maintain a list of things I worked on so there s no concrete list atm. Maybe I ll get back to this section later or will start to list stuff from the fall, as I was doing before. :D

Debian (E)LTS
Debian Long Term Support (LTS) is a project to extend the lifetime of all Debian stable releases to (at least) 5 years. Debian LTS is not handled by the Debian security team, but by a separate group of volunteers and companies interested in making it a success. And Debian Extended LTS (ELTS) is its sister project, extending support to the stretch and jessie release (+2 years after LTS support). This was my thirty-ninth month as a Debian LTS and thirtieth month as a Debian ELTS paid contributor.
I worked for 51.50 hours for LTS and 22.50 hours for ELTS.

LTS CVE Fixes and Announcements:

ELTS CVE Fixes and Announcements:
  • Issued ELA 752-1, fixing CVE-2021-41182, CVE-2021-41183, CVE-2021-41184, and CVE-2022-31160, for jqueryui.
    For Debian 9 stretch, these problems have been fixed in version 1.12.1+dfsg-4+deb9u1.
  • Helped facilitate Erlang s and RabbitMQ s update; cf: ELA 754-1.
  • Looked through python3.4 s FTBFS on armhf. Even diff d with Ubuntu. No luck. Inspected the traces, doesn t give a lot of hint either. Will continue to look later next month or so but it s a rabbit hole. (:
  • Inspected joblib s security update upon Helmut s investigation and see what went wrong there.
  • Started to look at other set of packages: dropbear, tiff, et al.

Other (E)LTS Work:
Until next time.
:wq for today.

15 December 2022

Reproducible Builds: Supporter spotlight: David A. Wheeler on supply chain security

The Reproducible Builds project relies on several projects, supporters and sponsors for financial support, but they are also valued as ambassadors who spread the word about our project and the work that we do. This is the sixth instalment in a series featuring the projects, companies and individuals who support the Reproducible Builds project. We started this series by featuring the Civil Infrastructure Platform project and followed this up with a post about the Ford Foundation as well as a recent ones about ARDC, the Google Open Source Security Team (GOSST), Jan Nieuwenhuizen on Bootstrappable Builds, GNU Mes and GNU Guix and Hans-Christoph Steiner of the F-Droid project. Today, however, we will be talking with David A. Wheeler, the Director of Open Source Supply Chain Security at the Linux Foundation.

Holger Levsen: Welcome, David, thanks for taking the time to talk with us today. First, could you briefly tell me about yourself? David: Sure! I m David A. Wheeler and I work for the Linux Foundation as the Director of Open Source Supply Chain Security. That just means that my job is to help open source software projects improve their security, including its development, build, distribution, and incorporation in larger works, all the way out to its eventual use by end-users. In my copious free time I also teach at George Mason University (GMU); in particular, I teach a graduate course on how to design and implement secure software. My background is technical. I have a Bachelor s in Electronics Engineering, a Master s in Computer Science and a PhD in Information Technology. My PhD dissertation is connected to reproducible builds. My PhD dissertation was on countering the Trusting Trust attack, an attack that subverts fundamental build system tools such as compilers. The attack was discovered by Karger & Schell in the 1970s, and later demonstrated & popularized by Ken Thompson. In my dissertation on trusting trust I showed that a process called Diverse Double-Compiling (DDC) could detect trusting trust attacks. That process is a specialized kind of reproducible build specifically designed to detect trusting trust style attacks. In addition, countering the trusting trust attack primarily becomes more important only when reproducible builds become more common. Reproducible builds enable detection of build-time subversions. Most attackers wouldn t bother with a trusting trust attack if they could just directly use a build-time subversion of the software they actually want to subvert.
Holger: Thanks for taking the time to introduce yourself to us. What do you think are the biggest challenges today in computing? There are many big challenges in computing today. For example:
Holger: Do you think reproducible builds are an important part in secure computing today already? David: Yes, but first let s put things in context. Today, when attackers exploit software vulnerabilities, they re primarily exploiting unintentional vulnerabilities that were created by the software developers. There are a lot of efforts to counter this: We re just starting to get better at this, which is good. However, attackers always try to attack the easiest target. As our deployed software has started to be hardened against attack, attackers have dramatically increased their attacks on the software supply chain (Sonatype found in 2022 that there s been a 742% increase year-over-year). The software supply chain hasn t historically gotten much attention, making it the easy target. There are simple supply chain attacks with simple solutions: Unfortunately, attackers know there are other lines of attack. One of the most dangerous is subverted build systems, as demonstrated by the subversion of SolarWinds Orion system. In a subverted build system, developers can review the software source code all day and see no problem, because there is no problem there. Instead, the process to convert source code into the code people run, called the build system , is subverted by an attacker. One solution for countering subverted build systems is to make the build systems harder to attack. That s a good thing to do, but you can never be confident that it was good enough . How can you be sure it s not subverted, if there s no way to know? A stronger defense against subverted build systems is the idea of verified reproducible builds. A build is reproducible if given the same source code, build environment and build instructions, any party can recreate bit-by-bit identical copies of all specified artifacts. A build is verified if multiple different parties verify that they get the same result for that situation. When you have a verified reproducible build, either all the parties colluded (and you could always double-check it yourself), or the build process isn t subverted. There is one last turtle: What if the build system tools or machines are subverted themselves? This is not a common attack today, but it s important to know if we can address them when the time comes. The good news is that we can address this. For some situations reproducible builds can also counter such attacks. If there s a loop (that is, a compiler is used to generate itself), that s called the trusting trust attack, and that is more challenging. Thankfully, the trusting trust attack has been known about for decades and there are known solutions. The diverse double-compiling (DDC) process that I explained in my PhD dissertation, as well as the bootstrappable builds process, can both counter trusting trust attacks in the software space. So there is no reason to lose hope: there is a bottom turtle , as it were.
Holger: Thankfully, this has all slowly started to change and supply chain issues are now widely discussed, as evident by efforts like Securing the Software Supply Chain: Recommended Practices Guide for Developers which you shared on our mailing list. In there, Reproducible Builds are mentioned as recommended advanced practice, which is both pretty cool (we ve come a long way!), but to me it also sounds like this will take another decade until it s become standard normal procedure. Do you agree on that timeline? David: I don t think there will be any particular timeframe. Different projects and ecosystems will move at different speeds. I wouldn t be surprised if it took a decade or so for them to become relatively common there are good reasons for that. Today the most common kinds of attacks based on software vulnerabilities still involve unintentional vulnerabilities in operational systems. Attackers are starting to apply supply chain attacks, but the top such attacks today are typosquatting (creating packages with similar names) and dependency confusion) (convincing projects to download packages from the wrong repositories). Reproducible builds don t counter those kinds of attacks, they counter subverted builds. It s important to eventually have verified reproducible builds, but understandably other issues are currently getting prioritized first. That said, reproducible builds are important long term. Many people are working on countering unintentional vulnerabilities and the most common kinds of supply chain attacks. As these other threats are countered, attackers will increasingly target build systems. Attackers always go for the weakest link. We will eventually need verified reproducible builds in many situations, and it ll take a while to get build systems able to widely perform reproducible builds, so we need to start that work now. That s true for anything where you know you ll need it but it will take a long time to get ready you need to start now.
Holger: What are your suggestions to accelerate adoption? David: Reproducible builds need to be: I think there s a snowball effect. Once many projects packages are reproducible, it will be easier to convince other projects to make their packages reproducible. I also think there should be some prioritization. If a package is in wide use (e.g., part of minimum set of packages for a widely-used Linux distribution or framework), its reproducibility should be a special focus. If a package is vital for supporting some societally important critical infrastructure (e.g., running dams), it should also be considered important. You can then work on the ones that are less important over time.
Holger: How is the Best Practices Badge going? How many projects are participating and how many are missing? David: It s going very well. You can see some automatically-generated statistics, showing we have over 5,000 projects, adding more than 1/day on average. We have more than 900 projects that have earned at least the passing badge level.
Holger: How many of the projects participating in the Best Practices badge engaging with reproducible builds? David: As of this writing there are 168 projects that report meeting the reproducible builds criterion. That s a relatively small percentage of projects. However, note that this criterion (labelled build_reproducible) is only required for the gold badge. It s not required for the passing or silver level badge. Currently we ve been strategically focused on getting projects to at least earn a passing badge, and less on earning silver or gold badges. We would love for all projects to get earn a silver or gold badge, of course, but our theory is that projects that can t even earn a passing badge present the most risk to their users. That said, there are some projects we especially want to see implementing higher badge levels. Those include projects that are very widely used, so that vulnerabilities in them can impact many systems. Examples of such projects include the Linux kernel and curl. In addition, some projects are used within systems where it s important to society that they not have serious security vulnerabilities. Examples include projects used by chemical manufacturers, financial systems and weapons. We definitely encourage any of those kinds of projects to earn higher badge levels.
Holger: Many thanks for this interview, David, and for all of your work at the Linux Foundation and elsewhere!




For more information about the Reproducible Builds project, please see our website at reproducible-builds.org. If you are interested in ensuring the ongoing security of the software that underpins our civilisation and wish to sponsor the Reproducible Builds project, please reach out to the project by emailing contact@reproducible-builds.org.

16 November 2022

Antoine Beaupr : A ZFS migration

In my tubman setup, I started using ZFS on an old server I had lying around. The machine is really old though (2011!) and it "feels" pretty slow. I want to see how much of that is ZFS and how much is the machine. Synthetic benchmarks show that ZFS may be slower than mdadm in RAID-10 or RAID-6 configuration, so I want to confirm that on a live workload: my workstation. Plus, I want easy, regular, high performance backups (with send/receive snapshots) and there's no way I'm going to use BTRFS because I find it too confusing and unreliable. So off we go.

Installation Since this is a conversion (and not a new install), our procedure is slightly different than the official documentation but otherwise it's pretty much in the same spirit: we're going to use ZFS for everything, including the root filesystem. So, install the required packages, on the current system:
apt install --yes gdisk zfs-dkms zfs zfs-initramfs zfsutils-linux
We also tell DKMS that we need to rebuild the initrd when upgrading:
echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf

Partitioning This is going to partition /dev/sdc with:
  • 1MB MBR / BIOS legacy boot
  • 512MB EFI boot
  • 1GB bpool, unencrypted pool for /boot
  • rest of the disk for zpool, the rest of the data
     sgdisk --zap-all /dev/sdc
     sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sdc
     sgdisk     -n2:1M:+512M   -t2:EF00 /dev/sdc
     sgdisk     -n3:0:+1G      -t3:BF01 /dev/sdc
     sgdisk     -n4:0:0        -t4:BF00 /dev/sdc
    
That will look something like this:
    root@curie:/home/anarcat# sgdisk -p /dev/sdc
    Disk /dev/sdc: 1953525168 sectors, 931.5 GiB
    Model: ESD-S1C         
    Sector size (logical/physical): 512/512 bytes
    Disk identifier (GUID): [REDACTED]
    Partition table holds up to 128 entries
    Main partition table begins at sector 2 and ends at sector 33
    First usable sector is 34, last usable sector is 1953525134
    Partitions will be aligned on 16-sector boundaries
    Total free space is 14 sectors (7.0 KiB)
    Number  Start (sector)    End (sector)  Size       Code  Name
       1              48            2047   1000.0 KiB  EF02  
       2            2048         1050623   512.0 MiB   EF00  
       3         1050624         3147775   1024.0 MiB  BF01  
       4         3147776      1953525134   930.0 GiB   BF00
Unfortunately, we can't be sure of the sector size here, because the USB controller is probably lying to us about it. Normally, this smartctl command should tell us the sector size as well:
root@curie:~# smartctl -i /dev/sdb -qnoserial
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-14-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black Mobile
Device Model:     WDC WD10JPLX-00MBPT0
Firmware Version: 01.01H01
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 17 13:33:04 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Above is the example of the builtin HDD drive. But the SSD device enclosed in that USB controller doesn't support SMART commands, so we can't trust that it really has 512 bytes sectors. This matters because we need to tweak the ashift value correctly. We're going to go ahead the SSD drive has the common 4KB settings, which means ashift=12. Note here that we are not creating a separate partition for swap. Swap on ZFS volumes (AKA "swap on ZVOL") can trigger lockups and that issue is still not fixed upstream. Ubuntu recommends using a separate partition for swap instead. But since this is "just" a workstation, we're betting that we will not suffer from this problem, after hearing a report from another Debian developer running this setup on their workstation successfully. We do not recommend this setup though. In fact, if I were to redo this partition scheme, I would probably use LUKS encryption and setup a dedicated swap partition, as I had problems with ZFS encryption as well.

Creating pools ZFS pools are somewhat like "volume groups" if you are familiar with LVM, except they obviously also do things like RAID-10. (Even though LVM can technically also do RAID, people typically use mdadm instead.) In any case, the guide suggests creating two different pools here: one, in cleartext, for boot, and a separate, encrypted one, for the rest. Technically, the boot partition is required because the Grub bootloader only supports readonly ZFS pools, from what I understand. But I'm a little out of my depth here and just following the guide.

Boot pool creation This creates the boot pool in readonly mode with features that grub supports:
    zpool create \
        -o cachefile=/etc/zfs/zpool.cache \
        -o ashift=12 -d \
        -o feature@async_destroy=enabled \
        -o feature@bookmarks=enabled \
        -o feature@embedded_data=enabled \
        -o feature@empty_bpobj=enabled \
        -o feature@enabled_txg=enabled \
        -o feature@extensible_dataset=enabled \
        -o feature@filesystem_limits=enabled \
        -o feature@hole_birth=enabled \
        -o feature@large_blocks=enabled \
        -o feature@lz4_compress=enabled \
        -o feature@spacemap_histogram=enabled \
        -o feature@zpool_checkpoint=enabled \
        -O acltype=posixacl -O canmount=off \
        -O compression=lz4 \
        -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
        -O mountpoint=/boot -R /mnt \
        bpool /dev/sdc3
I haven't investigated all those settings and just trust the upstream guide on the above.

Main pool creation This is a more typical pool creation.
    zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/ -R /mnt \
        rpool /dev/sdc4
Breaking this down:
  • -o ashift=12: mentioned above, 4k sector size
  • -O encryption=on -O keylocation=prompt -O keyformat=passphrase: encryption, prompt for a password, default algorithm is aes-256-gcm, explicit in the guide, made implicit here
  • -O acltype=posixacl -O xattr=sa: enable ACLs, with better performance (not enabled by default)
  • -O dnodesize=auto: related to extended attributes, less compatibility with other implementations
  • -O compression=zstd: enable zstd compression, can be disabled/enabled by dataset to with zfs set compression=off rpool/example
  • -O relatime=on: classic atime optimisation, another that could be used on a busy server is atime=off
  • -O canmount=off: do not make the pool mount automatically with mount -a?
  • -O mountpoint=/ -R /mnt: mount pool on / in the future, but /mnt for now
Those settings are all available in zfsprops(8). Other flags are defined in zpool-create(8). The reasoning behind them is also explained in the upstream guide and some also in [the Debian wiki][]. Those flags were actually not used:
  • -O normalization=formD: normalize file names on comparisons (not storage), implies utf8only=on, which is a bad idea (and effectively meant my first sync failed to copy some files, including this folder from a supysonic checkout). and this cannot be changed after the filesystem is created. bad, bad, bad.
[the Debian wiki]: https://wiki.debian.org/ZFS#Advanced_Topics

Side note about single-disk pools Also note that we're living dangerously here: single-disk ZFS pools are rumoured to be more dangerous than not running ZFS at all. The choice quote from this article is:
[...] any error can be detected, but cannot be corrected. This sounds like an acceptable compromise, but its actually not. The reason its not is that ZFS' metadata cannot be allowed to be corrupted. If it is it is likely the zpool will be impossible to mount (and will probably crash the system once the corruption is found). So a couple of bad sectors in the right place will mean that all data on the zpool will be lost. Not some, all. Also there's no ZFS recovery tools, so you cannot recover any data on the drives.
Compared with (say) ext4, where a single disk error can recovered, this is pretty bad. But we are ready to live with this with the idea that we'll have hourly offline snapshots that we can easily recover from. It's trade-off. Also, we're running this on a NVMe/M.2 drive which typically just blinks out of existence completely, and doesn't "bit rot" the way a HDD would. Also, the FreeBSD handbook quick start doesn't have any warnings about their first example, which is with a single disk. So I am reassured at least.

Creating mount points Next we create the actual filesystems, known as "datasets" which are the things that get mounted on mountpoint and hold the actual files.
  • this creates two containers, for ROOT and BOOT
     zfs create -o canmount=off -o mountpoint=none rpool/ROOT &&
     zfs create -o canmount=off -o mountpoint=none bpool/BOOT
    
    Note that it's unclear to me why those datasets are necessary, but they seem common practice, also used in this FreeBSD example. The OpenZFS guide mentions the Solaris upgrades and Ubuntu's zsys that use that container for upgrades and rollbacks. This blog post seems to explain a bit the layout behind the installer.
  • this creates the actual boot and root filesystems:
     zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian &&
     zfs mount rpool/ROOT/debian &&
     zfs create -o mountpoint=/boot bpool/BOOT/debian
    
    I guess the debian name here is because we could technically have multiple operating systems with the same underlying datasets.
  • then the main datasets:
     zfs create                                 rpool/home &&
     zfs create -o mountpoint=/root             rpool/home/root &&
     chmod 700 /mnt/root &&
     zfs create                                 rpool/var
    
  • exclude temporary files from snapshots:
     zfs create -o com.sun:auto-snapshot=false  rpool/var/cache &&
     zfs create -o com.sun:auto-snapshot=false  rpool/var/tmp &&
     chmod 1777 /mnt/var/tmp
    
  • and skip automatic snapshots in Docker:
     zfs create -o canmount=off                 rpool/var/lib &&
     zfs create -o com.sun:auto-snapshot=false  rpool/var/lib/docker
    
    Notice here a peculiarity: we must create rpool/var/lib to create rpool/var/lib/docker otherwise we get this error:
     cannot create 'rpool/var/lib/docker': parent does not exist
    
    ... and no, just creating /mnt/var/lib doesn't fix that problem. In fact, it makes things even more confusing because an existing directory shadows a mountpoint, which is the opposite of how things normally work. Also note that you will probably need to change storage driver in Docker, see the zfs-driver documentation for details but, basically, I did:
    echo '  "storage-driver": "zfs"  ' > /etc/docker/daemon.json
    
    Note that podman has the same problem (and similar solution):
    printf '[storage]\ndriver = "zfs"\n' > /etc/containers/storage.conf
    
  • make a tmpfs for /run:
     mkdir /mnt/run &&
     mount -t tmpfs tmpfs /mnt/run &&
     mkdir /mnt/run/lock
    
We don't create a /srv, as that's the HDD stuff. Also mount the EFI partition:
mkfs.fat -F 32 /dev/sdc2 &&
mount /dev/sdc2 /mnt/boot/efi/
At this point, everything should be mounted in /mnt. It should look like this:
root@curie:~# LANG=C df -h -t zfs -t vfat
Filesystem            Size  Used Avail Use% Mounted on
rpool/ROOT/debian     899G  384K  899G   1% /mnt
bpool/BOOT/debian     832M  123M  709M  15% /mnt/boot
rpool/home            899G  256K  899G   1% /mnt/home
rpool/home/root       899G  256K  899G   1% /mnt/root
rpool/var             899G  384K  899G   1% /mnt/var
rpool/var/cache       899G  256K  899G   1% /mnt/var/cache
rpool/var/tmp         899G  256K  899G   1% /mnt/var/tmp
rpool/var/lib/docker  899G  256K  899G   1% /mnt/var/lib/docker
/dev/sdc2             511M  4.0K  511M   1% /mnt/boot/efi
Now that we have everything setup and mounted, let's copy all files over.

Copying files This is a list of all the mounted filesystems
for fs in /boot/ /boot/efi/ / /home/; do
    echo "syncing $fs to /mnt$fs..." && 
    rsync -aSHAXx --info=progress2 --delete $fs /mnt$fs
done
You can check that the list is correct with:
mount -l -t ext4,btrfs,vfat   awk ' print $3 '
Note that we skip /srv as it's on a different disk. On the first run, we had:
root@curie:~# for fs in /boot/ /boot/efi/ / /home/; do
        echo "syncing $fs to /mnt$fs..." && 
        rsync -aSHAXx --info=progress2 $fs /mnt$fs
    done
syncing /boot/ to /mnt/boot/...
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/299)  
syncing /boot/efi/ to /mnt/boot/efi/...
     16,831,437 100%  184.14MB/s    0:00:00 (xfr#101, to-chk=0/110)
syncing / to /mnt/...
 28,019,293,280  94%   47.63MB/s    0:09:21 (xfr#703710, ir-chk=6748/839220)rsync: [generator] delete_file: rmdir(var/lib/docker) failed: Device or resource busy (16)
could not make way for new symlink: var/lib/docker
 34,081,267,990  98%   50.71MB/s    0:10:40 (xfr#736577, to-chk=0/867732)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
syncing /home/ to /mnt/home/...
rsync: [sender] readlink_stat("/home/anarcat/.fuse") failed: Permission denied (13)
 24,456,268,098  98%   68.03MB/s    0:05:42 (xfr#159867, ir-chk=6875/172377) 
file has vanished: "/home/anarcat/.cache/mozilla/firefox/s2hwvqbu.quantum/cache2/entries/B3AB0CDA9C4454B3C1197E5A22669DF8EE849D90"
199,762,528,125  93%   74.82MB/s    0:42:26 (xfr#1437846, ir-chk=1018/1983979)rsync: [generator] recv_generator: mkdir "/mnt/home/anarcat/dist/supysonic/tests/assets/\#346" failed: Invalid or incomplete multibyte or wide character (84)
*** Skipping any contents from this failed directory ***
315,384,723,978  96%   76.82MB/s    1:05:15 (xfr#2256473, to-chk=0/2993950)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
Note the failure to transfer that supysonic file? It turns out they had a weird filename in their source tree, since then removed, but still it showed how the utf8only feature might not be such a bad idea. At this point, the procedure was restarted all the way back to "Creating pools", after unmounting all ZFS filesystems (umount /mnt/run /mnt/boot/efi && umount -t zfs -a) and destroying the pool, which, surprisingly, doesn't require any confirmation (zpool destroy rpool). The second run was cleaner:
root@curie:~# for fs in /boot/ /boot/efi/ / /home/; do
        echo "syncing $fs to /mnt$fs..." && 
        rsync -aSHAXx --info=progress2 --delete $fs /mnt$fs
    done
syncing /boot/ to /mnt/boot/...
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/299)  
syncing /boot/efi/ to /mnt/boot/efi/...
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/110)  
syncing / to /mnt/...
 28,019,033,070  97%   42.03MB/s    0:10:35 (xfr#703671, ir-chk=1093/833515)rsync: [generator] delete_file: rmdir(var/lib/docker) failed: Device or resource busy (16)
could not make way for new symlink: var/lib/docker
 34,081,807,102  98%   44.84MB/s    0:12:04 (xfr#736580, to-chk=0/867723)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
syncing /home/ to /mnt/home/...
rsync: [sender] readlink_stat("/home/anarcat/.fuse") failed: Permission denied (13)
IO error encountered -- skipping file deletion
 24,043,086,450  96%   62.03MB/s    0:06:09 (xfr#151819, ir-chk=15117/172571)
file has vanished: "/home/anarcat/.cache/mozilla/firefox/s2hwvqbu.quantum/cache2/entries/4C1FDBFEA976FF924D062FB990B24B897A77B84B"
315,423,626,507  96%   67.09MB/s    1:14:43 (xfr#2256845, to-chk=0/2994364)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
Also note the transfer speed: we seem capped at 76MB/s, or 608Mbit/s. This is not as fast as I was expecting: the USB connection seems to be at around 5Gbps:
anarcat@curie:~$ lsusb -tv   head -4
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
     __ Port 1: Dev 4, If 0, Class=Mass Storage, Driver=uas, 5000M
        ID 0b05:1932 ASUSTek Computer, Inc.
So it shouldn't cap at that speed. It's possible the USB adapter is failing to give me the full speed though. It's not the M.2 SSD drive either, as that has a ~500MB/s bandwidth, acccording to its spec. At this point, we're about ready to do the final configuration. We drop to single user mode and do the rest of the procedure. That used to be shutdown now, but it seems like the systemd switch broke that, so now you can reboot into grub and pick the "recovery" option. Alternatively, you might try systemctl rescue, as I found out. I also wanted to copy the drive over to another new NVMe drive, but that failed: it looks like the USB controller I have doesn't work with older, non-NVME drives.

Boot configuration Now we need to enter the new system to rebuild the boot loader and initrd and so on. First, we bind mounts and chroot into the ZFS disk:
mount --rbind /dev  /mnt/dev &&
mount --rbind /proc /mnt/proc &&
mount --rbind /sys  /mnt/sys &&
chroot /mnt /bin/bash
Next we add an extra service that imports the bpool on boot, to make sure it survives a zpool.cache destruction:
cat > /etc/systemd/system/zfs-import-bpool.service <<EOF
[Unit]
DefaultDependencies=no
Before=zfs-import-scan.service
Before=zfs-import-cache.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -N -o cachefile=none bpool
# Work-around to preserve zpool cache:
ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
[Install]
WantedBy=zfs-import.target
EOF
Enable the service:
systemctl enable zfs-import-bpool.service
I had to trim down /etc/fstab and /etc/crypttab to only contain references to the legacy filesystems (/srv is still BTRFS!). If we don't already have a tmpfs defined in /etc/fstab:
ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/ &&
systemctl enable tmp.mount
Rebuild boot loader with support for ZFS, but also to workaround GRUB's missing zpool-features support:
grub-probe /boot   grep -q zfs &&
update-initramfs -c -k all &&
sed -i 's,GRUB_CMDLINE_LINUX.*,GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian",' /etc/default/grub &&
update-grub
For good measure, make sure the right disk is configured here, for example you might want to tag both drives in a RAID array:
dpkg-reconfigure grub-pc
Install grub to EFI while you're there:
grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=debian --recheck --no-floppy
Filesystem mount ordering. The rationale here in the OpenZFS guide is a little strange, but I don't dare ignore that.
mkdir /etc/zfs/zfs-list.cache
touch /etc/zfs/zfs-list.cache/bpool
touch /etc/zfs/zfs-list.cache/rpool
zed -F &
Verify that zed updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
cat /etc/zfs/zfs-list.cache/rpool
Once the files have data, stop zed:
fg
Press Ctrl-C.
Fix the paths to eliminate /mnt:
sed -Ei "s /mnt/? / " /etc/zfs/zfs-list.cache/*
Snapshot initial install:
zfs snapshot bpool/BOOT/debian@install
zfs snapshot rpool/ROOT/debian@install
Exit chroot:
exit

Finalizing One last sync was done in rescue mode:
for fs in /boot/ /boot/efi/ / /home/; do
    echo "syncing $fs to /mnt$fs..." && 
    rsync -aSHAXx --info=progress2 --delete $fs /mnt$fs
done
Then we unmount all filesystems:
mount   grep -v zfs   tac   awk '/\/mnt/  print $3 '   xargs -i  umount -lf  
zpool export -a
Reboot, swap the drives, and boot in ZFS. Hurray!

Benchmarks This is a test that was ran in single-user mode using fio and the Ars Technica recommended tests, which are:
  • Single 4KiB random write process:
     fio --name=randwrite4k1x --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
    
  • 16 parallel 64KiB random write processes:
     fio --name=randwrite64k16x --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
    
  • Single 1MiB random write process:
     fio --name=randwrite1m1x --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
    
Strangely, that's not exactly what the author, Jim Salter, did in his actual test bench used in the ZFS benchmarking article. The first thing is there's no read test at all, which is already pretty strange. But also it doesn't include stuff like dropping caches or repeating results. So here's my variation, which i called fio-ars-bench.sh for now. It just batches a bunch of fio tests, one by one, 60 seconds each. It should take about 12 minutes to run, as there are 3 pair of tests, read/write, with and without async. My bias, before building, running and analysing those results is that ZFS should outperform the traditional stack on writes, but possibly not on reads. It's also possible it outperforms it on both, because it's a newer drive. A new test might be possible with a new external USB drive as well, although I doubt I will find the time to do this.

Results All tests were done on WD blue SN550 drives, which claims to be able to push 2400MB/s read and 1750MB/s write. An extra drive was bought to move the LVM setup from a WDC WDS500G1B0B-00AS40 SSD, a WD blue M.2 2280 SSD that was at least 5 years old, spec'd at 560MB/s read, 530MB/s write. Benchmarks were done on the M.2 SSD drive but discarded so that the drive difference is not a factor in the test. In practice, I'm going to assume we'll never reach those numbers because we're not actually NVMe (this is an old workstation!) so the bottleneck isn't the disk itself. For our purposes, it might still give us useful results.

Rescue test, LUKS/LVM/ext4 Those tests were performed with everything shutdown, after either entering the system in rescue mode, or by reaching that target with:
systemctl rescue
The network might have been started before or after the test as well:
systemctl start systemd-networkd
So it should be fairly reliable as basically nothing else is running. Raw numbers, from the ?job-curie-lvm.log, converted to MiB/s and manually merged:
test read I/O read IOPS write I/O write IOPS
rand4k4g1x 39.27 10052 212.15 54310
rand4k4g1x--fsync=1 39.29 10057 2.73 699
rand64k256m16x 1297.00 20751 1068.57 17097
rand64k256m16x--fsync=1 1290.90 20654 353.82 5661
rand1m16g1x 315.15 315 563.77 563
rand1m16g1x--fsync=1 345.88 345 157.01 157
Peaks are at about 20k IOPS and ~1.3GiB/s read, 1GiB/s write in the 64KB blocks with 16 jobs. Slowest is the random 4k block sync write at an abysmal 3MB/s and 700 IOPS The 1MB read/write tests have lower IOPS, but that is expected.

Rescue test, ZFS This test was also performed in rescue mode. Raw numbers, from the ?job-curie-zfs.log, converted to MiB/s and manually merged:
test read I/O read IOPS write I/O write IOPS
rand4k4g1x 77.20 19763 27.13 6944
rand4k4g1x--fsync=1 76.16 19495 6.53 1673
rand64k256m16x 1882.40 30118 70.58 1129
rand64k256m16x--fsync=1 1865.13 29842 71.98 1151
rand1m16g1x 921.62 921 102.21 102
rand1m16g1x--fsync=1 908.37 908 64.30 64
Peaks are at 1.8GiB/s read, also in the 64k job like above, but much faster. The write is, as expected, much slower at 70MiB/s (compared to 1GiB/s!), but it should be noted the sync write doesn't degrade performance compared to async writes (although it's still below the LVM 300MB/s).

Conclusions Really, ZFS has trouble performing in all write conditions. The random 4k sync write test is the only place where ZFS outperforms LVM in writes, and barely (7MiB/s vs 3MiB/s). Everywhere else, writes are much slower, sometimes by an order of magnitude. And before some ZFS zealot jumps in talking about the SLOG or some other cache that could be added to improved performance, I'll remind you that those numbers are on a bare bones NVMe drive, pretty much as fast storage as you can find on this machine. Adding another NVMe drive as a cache probably will not improve write performance here. Still, those are very different results than the tests performed by Salter which shows ZFS beating traditional configurations in all categories but uncached 4k reads (not writes!). That said, those tests are very different from the tests I performed here, where I test writes on a single disk, not a RAID array, which might explain the discrepancy. Also, note that neither LVM or ZFS manage to reach the 2400MB/s read and 1750MB/s write performance specification. ZFS does manage to reach 82% of the read performance (1973MB/s) and LVM 64% of the write performance (1120MB/s). LVM hits 57% of the read performance and ZFS hits barely 6% of the write performance. Overall, I'm a bit disappointed in the ZFS write performance here, I must say. Maybe I need to tweak the record size or some other ZFS voodoo, but I'll note that I didn't have to do any such configuration on the other side to kick ZFS in the pants...

Real world experience This section document not synthetic backups, but actual real world workloads, comparing before and after I switched my workstation to ZFS.

Docker performance I had the feeling that running some git hook (which was firing a Docker container) was "slower" somehow. It seems that, at runtime, ZFS backends are significant slower than their overlayfs/ext4 equivalent:
May 16 14:42:52 curie systemd[1]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87\x2dinit-merged.mount: Succeeded.
May 16 14:42:52 curie systemd[5161]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87\x2dinit-merged.mount: Succeeded.
May 16 14:42:52 curie systemd[1]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
May 16 14:42:53 curie dockerd[1723]: time="2022-05-16T14:42:53.087219426-04:00" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10 pid=151170
May 16 14:42:53 curie systemd[1]: Started libcontainer container af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10.
May 16 14:42:54 curie systemd[1]: docker-af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10.scope: Succeeded.
May 16 14:42:54 curie dockerd[1723]: time="2022-05-16T14:42:54.047297800-04:00" level=info msg="shim disconnected" id=af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10
May 16 14:42:54 curie dockerd[998]: time="2022-05-16T14:42:54.051365015-04:00" level=info msg="ignoring event" container=af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
May 16 14:42:54 curie systemd[2444]: run-docker-netns-f5453c87c879.mount: Succeeded.
May 16 14:42:54 curie systemd[5161]: run-docker-netns-f5453c87c879.mount: Succeeded.
May 16 14:42:54 curie systemd[2444]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
May 16 14:42:54 curie systemd[5161]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
May 16 14:42:54 curie systemd[1]: run-docker-netns-f5453c87c879.mount: Succeeded.
May 16 14:42:54 curie systemd[1]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
Translating this:
  • container setup: ~1 second
  • container runtime: ~1 second
  • container teardown: ~1 second
  • total runtime: 2-3 seconds
Obviously, those timestamps are not quite accurate enough to make precise measurements... After I switched to ZFS:
mai 30 15:31:39 curie systemd[1]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf\x2dinit.mount: Succeeded. 
mai 30 15:31:39 curie systemd[5287]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf\x2dinit.mount: Succeeded. 
mai 30 15:31:40 curie systemd[1]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded. 
mai 30 15:31:40 curie systemd[5287]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded. 
mai 30 15:31:41 curie dockerd[3199]: time="2022-05-30T15:31:41.551403693-04:00" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142 pid=141080 
mai 30 15:31:41 curie systemd[1]: run-docker-runtime\x2drunc-moby-42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142-runc.ZVcjvl.mount: Succeeded. 
mai 30 15:31:41 curie systemd[5287]: run-docker-runtime\x2drunc-moby-42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142-runc.ZVcjvl.mount: Succeeded. 
mai 30 15:31:41 curie systemd[1]: Started libcontainer container 42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142. 
mai 30 15:31:45 curie systemd[1]: docker-42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142.scope: Succeeded. 
mai 30 15:31:45 curie dockerd[3199]: time="2022-05-30T15:31:45.883019128-04:00" level=info msg="shim disconnected" id=42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142 
mai 30 15:31:45 curie dockerd[1726]: time="2022-05-30T15:31:45.883064491-04:00" level=info msg="ignoring event" container=42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" 
mai 30 15:31:45 curie systemd[1]: run-docker-netns-e45f5cf5f465.mount: Succeeded. 
mai 30 15:31:45 curie systemd[5287]: run-docker-netns-e45f5cf5f465.mount: Succeeded. 
mai 30 15:31:45 curie systemd[1]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded. 
mai 30 15:31:45 curie systemd[5287]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded.
That's double or triple the run time, from 2 seconds to 6 seconds. Most of the time is spent in run time, inside the container. Here's the breakdown:
  • container setup: ~2 seconds
  • container run: ~4 seconds
  • container teardown: ~1 second
  • total run time: about ~6-7 seconds
That's a two- to three-fold increase! Clearly something is going on here that I should tweak. It's possible that code path is less optimized in Docker. I also worry about podman, but apparently it also supports ZFS backends. Possibly it would perform better, but at this stage I wouldn't have a good comparison: maybe it would have performed better on non-ZFS as well...

Interactivity While doing the offsite backups (below), the system became somewhat "sluggish". I felt everything was slow, and I estimate it introduced ~50ms latency in any input device. Arguably, those are all USB and the external drive was connected through USB, but I suspect the ZFS drivers are not as well tuned with the scheduler as the regular filesystem drivers...

Recovery procedures For test purposes, I unmounted all systems during the procedure:
umount /mnt/boot/efi /mnt/boot/run
umount -a -t zfs
zpool export -a
And disconnected the drive, to see how I would recover this system from another Linux system in case of a total motherboard failure. To import an existing pool, plug the device, then import the pool with an alternate root, so it doesn't mount over your existing filesystems, then you mount the root filesystem and all the others:
zpool import -l -a -R /mnt &&
zfs mount rpool/ROOT/debian &&
zfs mount -a &&
mount /dev/sdc2 /mnt/boot/efi &&
mount -t tmpfs tmpfs /mnt/run &&
mkdir /mnt/run/lock

Offsite backup Part of the goal of using ZFS is to simplify and harden backups. I wanted to experiment with shorter recovery times specifically both point in time recovery objective and recovery time objective and faster incremental backups. This is, therefore, part of my backup services. This section documents how an external NVMe enclosure was setup in a pool to mirror the datasets from my workstation. The final setup should include syncoid copying datasets to the backup server regularly, but I haven't finished that configuration yet.

Partitioning The above partitioning procedure used sgdisk, but I couldn't figure out how to do this with sgdisk, so this uses sfdisk to dump the partition from the first disk to an external, identical drive:
sfdisk -d /dev/nvme0n1   sfdisk --no-reread /dev/sda --force

Pool creation This is similar to the main pool creation, except we tweaked a few bits after changing the upstream procedure:
zpool create \
        -o cachefile=/etc/zfs/zpool.cache \
        -o ashift=12 -d \
        -o feature@async_destroy=enabled \
        -o feature@bookmarks=enabled \
        -o feature@embedded_data=enabled \
        -o feature@empty_bpobj=enabled \
        -o feature@enabled_txg=enabled \
        -o feature@extensible_dataset=enabled \
        -o feature@filesystem_limits=enabled \
        -o feature@hole_birth=enabled \
        -o feature@large_blocks=enabled \
        -o feature@lz4_compress=enabled \
        -o feature@spacemap_histogram=enabled \
        -o feature@zpool_checkpoint=enabled \
        -O acltype=posixacl -O xattr=sa \
        -O compression=lz4 \
        -O devices=off \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/boot -R /mnt \
        bpool-tubman /dev/sdb3
The change from the main boot pool are: Main pool creation is:
zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/ -R /mnt \
        rpool-tubman /dev/sdb4

First sync I used syncoid to copy all pools over to the external device. syncoid is a thing that's part of the sanoid project which is specifically designed to sync snapshots between pool, typically over SSH links but it can also operate locally. The sanoid command had a --readonly argument to simulate changes, but syncoid didn't so I tried to fix that with an upstream PR. It seems it would be better to do this by hand, but this was much easier. The full first sync was:
root@curie:/home/anarcat# ./bin/syncoid -r  bpool bpool-tubman
CRITICAL ERROR: Target bpool-tubman exists but has no snapshots matching with bpool!
                Replication to target would require destroying existing
                target. Cowardly refusing to destroy your existing target.
          NOTE: Target bpool-tubman dataset is < 64MB used - did you mistakenly run
                 zfs create bpool-tubman  on the target? ZFS initial
                replication must be to a NON EXISTENT DATASET, which will
                then be CREATED BY the initial replication process.
INFO: Sending oldest full snapshot bpool/BOOT@test (~ 42 KB) to new target filesystem:
44.2KiB 0:00:00 [4.19MiB/s] [========================================================================================================================] 103%            
INFO: Updating new target filesystem with incremental bpool/BOOT@test ... syncoid_curie_2022-05-30:12:50:39 (~ 4 KB):
2.13KiB 0:00:00 [ 114KiB/s] [===============================================================>                                                         ] 53%            
INFO: Sending oldest full snapshot bpool/BOOT/debian@install (~ 126.0 MB) to new target filesystem:
 126MiB 0:00:00 [ 308MiB/s] [=======================================================================================================================>] 100%            
INFO: Updating new target filesystem with incremental bpool/BOOT/debian@install ... syncoid_curie_2022-05-30:12:50:39 (~ 113.4 MB):
 113MiB 0:00:00 [ 315MiB/s] [=======================================================================================================================>] 100%
root@curie:/home/anarcat# ./bin/syncoid -r  rpool rpool-tubman
CRITICAL ERROR: Target rpool-tubman exists but has no snapshots matching with rpool!
                Replication to target would require destroying existing
                target. Cowardly refusing to destroy your existing target.
          NOTE: Target rpool-tubman dataset is < 64MB used - did you mistakenly run
                 zfs create rpool-tubman  on the target? ZFS initial
                replication must be to a NON EXISTENT DATASET, which will
                then be CREATED BY the initial replication process.
INFO: Sending oldest full snapshot rpool/ROOT@syncoid_curie_2022-05-30:12:50:51 (~ 69 KB) to new target filesystem:
44.2KiB 0:00:00 [2.44MiB/s] [===========================================================================>                                             ] 63%            
INFO: Sending oldest full snapshot rpool/ROOT/debian@install (~ 25.9 GB) to new target filesystem:
25.9GiB 0:03:33 [ 124MiB/s] [=======================================================================================================================>] 100%            
INFO: Updating new target filesystem with incremental rpool/ROOT/debian@install ... syncoid_curie_2022-05-30:12:50:52 (~ 3.9 GB):
3.92GiB 0:00:33 [ 119MiB/s] [======================================================================================================================>  ] 99%            
INFO: Sending oldest full snapshot rpool/home@syncoid_curie_2022-05-30:12:55:04 (~ 276.8 GB) to new target filesystem:
 277GiB 0:27:13 [ 174MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/home/root@syncoid_curie_2022-05-30:13:22:19 (~ 2.2 GB) to new target filesystem:
2.22GiB 0:00:25 [90.2MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var@syncoid_curie_2022-05-30:13:22:47 (~ 5.6 GB) to new target filesystem:
5.56GiB 0:00:32 [ 176MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/cache@syncoid_curie_2022-05-30:13:23:22 (~ 627.3 MB) to new target filesystem:
 627MiB 0:00:03 [ 169MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/lib@syncoid_curie_2022-05-30:13:23:28 (~ 69 KB) to new target filesystem:
44.2KiB 0:00:00 [1.40MiB/s] [===========================================================================>                                             ] 63%            
INFO: Sending oldest full snapshot rpool/var/lib/docker@syncoid_curie_2022-05-30:13:23:28 (~ 442.6 MB) to new target filesystem:
 443MiB 0:00:04 [ 103MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/lib/docker/05c0de7fabbea60500eaa495d0d82038249f6faa63b12914737c4d71520e62c5@266253254 (~ 6.3 MB) to new target filesystem:
6.49MiB 0:00:00 [12.9MiB/s] [========================================================================================================================] 102%            
INFO: Updating new target filesystem with incremental rpool/var/lib/docker/05c0de7fabbea60500eaa495d0d82038249f6faa63b12914737c4d71520e62c5@266253254 ... syncoid_curie_2022-05-30:13:23:34 (~ 4 KB):
1.52KiB 0:00:00 [27.6KiB/s] [============================================>                                                                            ] 38%            
INFO: Sending oldest full snapshot rpool/var/lib/flatpak@syncoid_curie_2022-05-30:13:23:36 (~ 2.0 GB) to new target filesystem:
2.00GiB 0:00:17 [ 115MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/tmp@syncoid_curie_2022-05-30:13:23:55 (~ 57.0 MB) to new target filesystem:
61.8MiB 0:00:01 [45.0MiB/s] [========================================================================================================================] 108%            
INFO: Clone is recreated on target rpool-tubman/var/lib/docker/ed71ddd563a779ba6fb37b3b1d0cc2c11eca9b594e77b4b234867ebcb162b205 based on rpool/var/lib/docker/05c0de7fabbea60500eaa495d0d82038249f6faa63b12914737c4d71520e62c5@266253254
INFO: Sending oldest full snapshot rpool/var/lib/docker/ed71ddd563a779ba6fb37b3b1d0cc2c11eca9b594e77b4b234867ebcb162b205@syncoid_curie_2022-05-30:13:23:58 (~ 218.6 MB) to new target filesystem:
 219MiB 0:00:01 [ 151MiB/s] [=======================================================================================================================>] 100%
Funny how the CRITICAL ERROR doesn't actually stop syncoid and it just carries on merrily doing when it's telling you it's "cowardly refusing to destroy your existing target"... Maybe that's because my pull request broke something though... During the transfer, the computer was very sluggish: everything feels like it has ~30-50ms latency extra:
anarcat@curie:sanoid$ LANG=C top -b  -n 1   head -20
top - 13:07:05 up 6 days,  4:01,  1 user,  load average: 16.13, 16.55, 11.83
Tasks: 606 total,   6 running, 598 sleeping,   0 stopped,   2 zombie
%Cpu(s): 18.8 us, 72.5 sy,  1.2 ni,  5.0 id,  1.2 wa,  0.0 hi,  1.2 si,  0.0 st
MiB Mem :  15898.4 total,   1387.6 free,  13170.0 used,   1340.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1319.8 avail Mem 
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     70 root      20   0       0      0      0 S  83.3   0.0   6:12.67 kswapd0
4024878 root      20   0  282644  96432  10288 S  44.4   0.6   0:11.43 puppet
3896136 root      20   0   35328  16528     48 S  22.2   0.1   2:08.04 mbuffer
3896135 root      20   0   10328    776    168 R  16.7   0.0   1:22.93 zfs
3896138 root      20   0   10588    788    156 R  16.7   0.0   1:49.30 zfs
    350 root       0 -20       0      0      0 R  11.1   0.0   1:03.53 z_rd_int
    351 root       0 -20       0      0      0 S  11.1   0.0   1:04.15 z_rd_int
3896137 root      20   0    4384    352    244 R  11.1   0.0   0:44.73 pv
4034094 anarcat   30  10   20028  13960   2428 S  11.1   0.1   0:00.70 mbsync
4036539 anarcat   20   0    9604   3464   2408 R  11.1   0.0   0:00.04 top
    352 root       0 -20       0      0      0 S   5.6   0.0   1:03.64 z_rd_int
    353 root       0 -20       0      0      0 S   5.6   0.0   1:03.64 z_rd_int
    354 root       0 -20       0      0      0 S   5.6   0.0   1:04.01 z_rd_int
I wonder how much of that is due to syncoid, particularly because I often saw mbuffer and pv in there which are not strictly necessary to do those kind of operations, as far as I understand. Once that's done, export the pools to disconnect the drive:
zpool export bpool-tubman
zpool export rpool-tubman

Raw disk benchmark Copied the 512GB SSD/M.2 device to another 1024GB NVMe/M.2 device:
anarcat@curie:~$ sudo dd if=/dev/sdb of=/dev/sdc bs=4M status=progress conv=fdatasync
499944259584 octets (500 GB, 466 GiB) copi s, 1713 s, 292 MB/s
119235+1 enregistrements lus
119235+1 enregistrements  crits
500107862016 octets (500 GB, 466 GiB) copi s, 1719,93 s, 291 MB/s
... while both over USB, whoohoo 300MB/s!

Monitoring ZFS should be monitoring your pools regularly. Normally, the [[!debman zed]] daemon monitors all ZFS events. It is the thing that will report when a scrub failed, for example. See this configuration guide. Scrubs should be regularly scheduled to ensure consistency of the pool. This can be done in newer zfsutils-linux versions (bullseye-backports or bookworm) with one of those, depending on the desired frequency:
systemctl enable zfs-scrub-weekly@rpool.timer --now
systemctl enable zfs-scrub-monthly@rpool.timer --now
When the scrub runs, if it finds anything it will send an event which will get picked up by the zed daemon which will then send a notification, see below for an example. TODO: deploy on curie, if possible (probably not because no RAID) TODO: this should be in Puppet

Scrub warning example So what happens when problems are found? Here's an example of how I dealt with an error I received. After setting up another server (tubman) with ZFS, I eventually ended up getting a warning from the ZFS toolchain.
Date: Sun, 09 Oct 2022 00:58:08 -0400
From: root <root@anarc.at>
To: root@anarc.at
Subject: ZFS scrub_finish event for rpool on tubman
ZFS has finished a scrub:
   eid: 39536
 class: scrub_finish
  host: tubman
  time: 2022-10-09 00:58:07-0400
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:33:57 with 0 errors on Sun Oct  9 00:58:07 2022
config:
        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb4    ONLINE       0     1     0
            sdc4    ONLINE       0     0     0
        cache
          sda3      ONLINE       0     0     0
errors: No known data errors
This, in itself, is a little worrisome. But it helpfully links to this more detailed documentation (and props up there: the link still works) which explains this is a "minor" problem (something that could be included in the report). In this case, this happened on a server setup on 2021-04-28, but the disks and server hardware are much older. The server itself (marcos v1) was built around 2011, over 10 years ago now. The hard drive in question is:
root@tubman:~# smartctl -i -qnoserial /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 11 11:02:32 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Some more SMART stats:
root@tubman:~# smartctl -a -qnoserial /dev/sdb   grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12464 (206 202 0)
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10966h+55m+23.757s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       21107792664
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3201579750
That's over a year of power on, which shouldn't be so bad. It has written about 10TB of data (21107792664 LBAs * 512 byte/LBA), which is about two full writes. According to its specification, this device is supposed to support 55 TB/year of writes, so we're far below spec. Note that are still far from the "non-recoverable read error per bits" spec (1 per 10E15), as we've basically read 13E12 bits (3201579750 LBAs * 512 byte/LBA = 13E12 bits). It's likely this disk was made in 2018, so it is in its fourth year. Interestingly, /dev/sdc is also a Seagate drive, but of a different series:
root@tubman:~# smartctl -qnoserial  -i /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 11 11:21:35 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
It has seen much more reads than the other disk which is also interesting:
root@tubman:~# smartctl -a -qnoserial /dev/sdc   grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       36240
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       33994h+10m+52.118s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       30730174438
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       51894566538
That's 4 years of Head_Flying_Hours, and over 4 years (4 years and 48 days) of Power_On_Hours. The copyright date on that drive's specs goes back to 2016, so it's a much older drive. SMART self-test succeeded.

Remaining issues
  • TODO: move send/receive backups to offsite host, see also zfs for alternatives to syncoid/sanoid there
  • TODO: setup backup cron job (or timer?)
  • TODO: swap still not setup on curie, see zfs
  • TODO: document this somewhere: bpool and rpool are both pools and datasets. that's pretty confusing, but also very useful because it allows for pool-wide recursive snapshots, which are used for the backup system

fio improvements I really want to improve my experience with fio. Right now, I'm just cargo-culting stuff from other folks and I don't really like it. stressant is a good example of my struggles, in the sense that it doesn't really work that well for disk tests. I would love to have just a single .fio job file that lists multiple jobs to run serially. For example, this file describes the above workload pretty well:
[global]
# cargo-culting Salter
fallocate=none
ioengine=posixaio
runtime=60
time_based=1
end_fsync=1
stonewall=1
group_reporting=1
# no need to drop caches, done by default
# invalidate=1
# Single 4KiB random read/write process
[randread-4k-4g-1x]
rw=randread
bs=4k
size=4g
numjobs=1
iodepth=1
[randwrite-4k-4g-1x]
rw=randwrite
bs=4k
size=4g
numjobs=1
iodepth=1
# 16 parallel 64KiB random read/write processes:
[randread-64k-256m-16x]
rw=randread
bs=64k
size=256m
numjobs=16
iodepth=16
[randwrite-64k-256m-16x]
rw=randwrite
bs=64k
size=256m
numjobs=16
iodepth=16
# Single 1MiB random read/write process
[randread-1m-16g-1x]
rw=randread
bs=1m
size=16g
numjobs=1
iodepth=1
[randwrite-1m-16g-1x]
rw=randwrite
bs=1m
size=16g
numjobs=1
iodepth=1
... except the jobs are actually started in parallel, even though they are stonewall'd, as far as I can tell by the reports. I sent a mail to the fio mailing list for clarification. It looks like the jobs are started in parallel, but actual (correctly) run serially. It seems like this might just be a matter of reporting the right timestamps in the end, although it does feel like starting all the processes (even if not doing any work yet) could skew the results.

Hangs during procedure During the procedure, it happened a few times where any ZFS command would completely hang. It seems that using an external USB drive to sync stuff didn't work so well: sometimes it would reconnect under a different device (from sdc to sdd, for example), and this would greatly confuse ZFS. Here, for example, is sdd reappearing out of the blue:
May 19 11:22:53 curie kernel: [  699.820301] scsi host4: uas
May 19 11:22:53 curie kernel: [  699.820544] usb 2-1: authorized to connect
May 19 11:22:53 curie kernel: [  699.922433] scsi 4:0:0:0: Direct-Access     ROG      ESD-S1C          0    PQ: 0 ANSI: 6
May 19 11:22:53 curie kernel: [  699.923235] sd 4:0:0:0: Attached scsi generic sg2 type 0
May 19 11:22:53 curie kernel: [  699.923676] sd 4:0:0:0: [sdd] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
May 19 11:22:53 curie kernel: [  699.923788] sd 4:0:0:0: [sdd] Write Protect is off
May 19 11:22:53 curie kernel: [  699.923949] sd 4:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
May 19 11:22:53 curie kernel: [  699.924149] sd 4:0:0:0: [sdd] Optimal transfer size 33553920 bytes
May 19 11:22:53 curie kernel: [  699.961602]  sdd: sdd1 sdd2 sdd3 sdd4
May 19 11:22:53 curie kernel: [  699.996083] sd 4:0:0:0: [sdd] Attached SCSI disk
Next time I run a ZFS command (say zpool list), the command completely hangs (D state) and this comes up in the logs:
May 19 11:34:21 curie kernel: [ 1387.914843] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=71344128 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.914859] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=205565952 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.914874] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=272789504 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.914906] zio pool=bpool vdev=/dev/sdc3 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 19 11:34:21 curie kernel: [ 1387.914932] zio pool=bpool vdev=/dev/sdc3 error=5 type=1 offset=1073225728 size=8192 flags=b08c1
May 19 11:34:21 curie kernel: [ 1387.914948] zio pool=bpool vdev=/dev/sdc3 error=5 type=1 offset=1073487872 size=8192 flags=b08c1
May 19 11:34:21 curie kernel: [ 1387.915165] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=272793600 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.915183] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=339853312 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.915648] WARNING: Pool 'bpool' has encountered an uncorrectable I/O failure and has been suspended.
May 19 11:34:21 curie kernel: [ 1387.915648] 
May 19 11:37:25 curie kernel: [ 1571.558614] task:txg_sync        state:D stack:    0 pid:  997 ppid:     2 flags:0x00004000
May 19 11:37:25 curie kernel: [ 1571.558623] Call Trace:
May 19 11:37:25 curie kernel: [ 1571.558640]  __schedule+0x282/0x870
May 19 11:37:25 curie kernel: [ 1571.558650]  schedule+0x46/0xb0
May 19 11:37:25 curie kernel: [ 1571.558670]  schedule_timeout+0x8b/0x140
May 19 11:37:25 curie kernel: [ 1571.558675]  ? __next_timer_interrupt+0x110/0x110
May 19 11:37:25 curie kernel: [ 1571.558678]  io_schedule_timeout+0x4c/0x80
May 19 11:37:25 curie kernel: [ 1571.558689]  __cv_timedwait_common+0x12b/0x160 [spl]
May 19 11:37:25 curie kernel: [ 1571.558694]  ? add_wait_queue_exclusive+0x70/0x70
May 19 11:37:25 curie kernel: [ 1571.558702]  __cv_timedwait_io+0x15/0x20 [spl]
May 19 11:37:25 curie kernel: [ 1571.558816]  zio_wait+0x129/0x2b0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.558929]  dsl_pool_sync+0x461/0x4f0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559032]  spa_sync+0x575/0xfa0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559138]  ? spa_txg_history_init_io+0x101/0x110 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559245]  txg_sync_thread+0x2e0/0x4a0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559354]  ? txg_fini+0x240/0x240 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559366]  thread_generic_wrapper+0x6f/0x80 [spl]
May 19 11:37:25 curie kernel: [ 1571.559376]  ? __thread_exit+0x20/0x20 [spl]
May 19 11:37:25 curie kernel: [ 1571.559379]  kthread+0x11b/0x140
May 19 11:37:25 curie kernel: [ 1571.559382]  ? __kthread_bind_mask+0x60/0x60
May 19 11:37:25 curie kernel: [ 1571.559386]  ret_from_fork+0x22/0x30
May 19 11:37:25 curie kernel: [ 1571.559401] task:zed             state:D stack:    0 pid: 1564 ppid:     1 flags:0x00000000
May 19 11:37:25 curie kernel: [ 1571.559404] Call Trace:
May 19 11:37:25 curie kernel: [ 1571.559409]  __schedule+0x282/0x870
May 19 11:37:25 curie kernel: [ 1571.559412]  ? __kmalloc_node+0x141/0x2b0
May 19 11:37:25 curie kernel: [ 1571.559417]  schedule+0x46/0xb0
May 19 11:37:25 curie kernel: [ 1571.559420]  schedule_preempt_disabled+0xa/0x10
May 19 11:37:25 curie kernel: [ 1571.559424]  __mutex_lock.constprop.0+0x133/0x460
May 19 11:37:25 curie kernel: [ 1571.559435]  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
May 19 11:37:25 curie kernel: [ 1571.559537]  spa_all_configs+0x41/0x120 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559644]  zfs_ioc_pool_configs+0x17/0x70 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559752]  zfsdev_ioctl_common+0x697/0x870 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559758]  ? _copy_from_user+0x28/0x60
May 19 11:37:25 curie kernel: [ 1571.559860]  zfsdev_ioctl+0x53/0xe0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559866]  __x64_sys_ioctl+0x83/0xb0
May 19 11:37:25 curie kernel: [ 1571.559869]  do_syscall_64+0x33/0x80
May 19 11:37:25 curie kernel: [ 1571.559873]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 19 11:37:25 curie kernel: [ 1571.559876] RIP: 0033:0x7fcf0ef32cc7
May 19 11:37:25 curie kernel: [ 1571.559878] RSP: 002b:00007fcf0e181618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 19 11:37:25 curie kernel: [ 1571.559881] RAX: ffffffffffffffda RBX: 000055b212f972a0 RCX: 00007fcf0ef32cc7
May 19 11:37:25 curie kernel: [ 1571.559883] RDX: 00007fcf0e181640 RSI: 0000000000005a04 RDI: 000000000000000b
May 19 11:37:25 curie kernel: [ 1571.559885] RBP: 00007fcf0e184c30 R08: 00007fcf08016810 R09: 00007fcf08000080
May 19 11:37:25 curie kernel: [ 1571.559886] R10: 0000000000080000 R11: 0000000000000246 R12: 000055b212f972a0
May 19 11:37:25 curie kernel: [ 1571.559888] R13: 0000000000000000 R14: 00007fcf0e181640 R15: 0000000000000000
May 19 11:37:25 curie kernel: [ 1571.559980] task:zpool           state:D stack:    0 pid:11815 ppid:  3816 flags:0x00004000
May 19 11:37:25 curie kernel: [ 1571.559983] Call Trace:
May 19 11:37:25 curie kernel: [ 1571.559988]  __schedule+0x282/0x870
May 19 11:37:25 curie kernel: [ 1571.559992]  schedule+0x46/0xb0
May 19 11:37:25 curie kernel: [ 1571.559995]  io_schedule+0x42/0x70
May 19 11:37:25 curie kernel: [ 1571.560004]  cv_wait_common+0xac/0x130 [spl]
May 19 11:37:25 curie kernel: [ 1571.560008]  ? add_wait_queue_exclusive+0x70/0x70
May 19 11:37:25 curie kernel: [ 1571.560118]  txg_wait_synced_impl+0xc9/0x110 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560223]  txg_wait_synced+0xc/0x40 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560325]  spa_export_common+0x4cd/0x590 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560430]  ? zfs_log_history+0x9c/0xf0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560537]  zfsdev_ioctl_common+0x697/0x870 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560543]  ? _copy_from_user+0x28/0x60
May 19 11:37:25 curie kernel: [ 1571.560644]  zfsdev_ioctl+0x53/0xe0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560649]  __x64_sys_ioctl+0x83/0xb0
May 19 11:37:25 curie kernel: [ 1571.560653]  do_syscall_64+0x33/0x80
May 19 11:37:25 curie kernel: [ 1571.560656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 19 11:37:25 curie kernel: [ 1571.560659] RIP: 0033:0x7fdc23be2cc7
May 19 11:37:25 curie kernel: [ 1571.560661] RSP: 002b:00007ffc8c792478 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 19 11:37:25 curie kernel: [ 1571.560664] RAX: ffffffffffffffda RBX: 000055942ca49e20 RCX: 00007fdc23be2cc7
May 19 11:37:25 curie kernel: [ 1571.560666] RDX: 00007ffc8c792490 RSI: 0000000000005a03 RDI: 0000000000000003
May 19 11:37:25 curie kernel: [ 1571.560667] RBP: 00007ffc8c795e80 R08: 00000000ffffffff R09: 00007ffc8c792310
May 19 11:37:25 curie kernel: [ 1571.560669] R10: 000055942ca49e30 R11: 0000000000000246 R12: 00007ffc8c792490
May 19 11:37:25 curie kernel: [ 1571.560671] R13: 000055942ca49e30 R14: 000055942aed2c20 R15: 00007ffc8c795a40
Here's another example, where you see the USB controller bleeping out and back into existence:
mai 19 11:38:39 curie kernel: usb 2-1: USB disconnect, device number 2
mai 19 11:38:39 curie kernel: sd 4:0:0:0: [sdd] Synchronizing SCSI cache
mai 19 11:38:39 curie kernel: sd 4:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
mai 19 11:39:25 curie kernel: INFO: task zed:1564 blocked for more than 241 seconds.
mai 19 11:39:25 curie kernel:       Tainted: P          IOE     5.10.0-14-amd64 #1 Debian 5.10.113-1
mai 19 11:39:25 curie kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mai 19 11:39:25 curie kernel: task:zed             state:D stack:    0 pid: 1564 ppid:     1 flags:0x00000000
mai 19 11:39:25 curie kernel: Call Trace:
mai 19 11:39:25 curie kernel:  __schedule+0x282/0x870
mai 19 11:39:25 curie kernel:  ? __kmalloc_node+0x141/0x2b0
mai 19 11:39:25 curie kernel:  schedule+0x46/0xb0
mai 19 11:39:25 curie kernel:  schedule_preempt_disabled+0xa/0x10
mai 19 11:39:25 curie kernel:  __mutex_lock.constprop.0+0x133/0x460
mai 19 11:39:25 curie kernel:  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
mai 19 11:39:25 curie kernel:  spa_all_configs+0x41/0x120 [zfs]
mai 19 11:39:25 curie kernel:  zfs_ioc_pool_configs+0x17/0x70 [zfs]
mai 19 11:39:25 curie kernel:  zfsdev_ioctl_common+0x697/0x870 [zfs]
mai 19 11:39:25 curie kernel:  ? _copy_from_user+0x28/0x60
mai 19 11:39:25 curie kernel:  zfsdev_ioctl+0x53/0xe0 [zfs]
mai 19 11:39:25 curie kernel:  __x64_sys_ioctl+0x83/0xb0
mai 19 11:39:25 curie kernel:  do_syscall_64+0x33/0x80
mai 19 11:39:25 curie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
mai 19 11:39:25 curie kernel: RIP: 0033:0x7fcf0ef32cc7
mai 19 11:39:25 curie kernel: RSP: 002b:00007fcf0e181618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
mai 19 11:39:25 curie kernel: RAX: ffffffffffffffda RBX: 000055b212f972a0 RCX: 00007fcf0ef32cc7
mai 19 11:39:25 curie kernel: RDX: 00007fcf0e181640 RSI: 0000000000005a04 RDI: 000000000000000b
mai 19 11:39:25 curie kernel: RBP: 00007fcf0e184c30 R08: 00007fcf08016810 R09: 00007fcf08000080
mai 19 11:39:25 curie kernel: R10: 0000000000080000 R11: 0000000000000246 R12: 000055b212f972a0
mai 19 11:39:25 curie kernel: R13: 0000000000000000 R14: 00007fcf0e181640 R15: 0000000000000000
mai 19 11:39:25 curie kernel: INFO: task zpool:11815 blocked for more than 241 seconds.
mai 19 11:39:25 curie kernel:       Tainted: P          IOE     5.10.0-14-amd64 #1 Debian 5.10.113-1
mai 19 11:39:25 curie kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mai 19 11:39:25 curie kernel: task:zpool           state:D stack:    0 pid:11815 ppid:  2621 flags:0x00004004
mai 19 11:39:25 curie kernel: Call Trace:
mai 19 11:39:25 curie kernel:  __schedule+0x282/0x870
mai 19 11:39:25 curie kernel:  schedule+0x46/0xb0
mai 19 11:39:25 curie kernel:  io_schedule+0x42/0x70
mai 19 11:39:25 curie kernel:  cv_wait_common+0xac/0x130 [spl]
mai 19 11:39:25 curie kernel:  ? add_wait_queue_exclusive+0x70/0x70
mai 19 11:39:25 curie kernel:  txg_wait_synced_impl+0xc9/0x110 [zfs]
mai 19 11:39:25 curie kernel:  txg_wait_synced+0xc/0x40 [zfs]
mai 19 11:39:25 curie kernel:  spa_export_common+0x4cd/0x590 [zfs]
mai 19 11:39:25 curie kernel:  ? zfs_log_history+0x9c/0xf0 [zfs]
mai 19 11:39:25 curie kernel:  zfsdev_ioctl_common+0x697/0x870 [zfs]
mai 19 11:39:25 curie kernel:  ? _copy_from_user+0x28/0x60
mai 19 11:39:25 curie kernel:  zfsdev_ioctl+0x53/0xe0 [zfs]
mai 19 11:39:25 curie kernel:  __x64_sys_ioctl+0x83/0xb0
mai 19 11:39:25 curie kernel:  do_syscall_64+0x33/0x80
mai 19 11:39:25 curie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
mai 19 11:39:25 curie kernel: RIP: 0033:0x7fdc23be2cc7
mai 19 11:39:25 curie kernel: RSP: 002b:00007ffc8c792478 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
mai 19 11:39:25 curie kernel: RAX: ffffffffffffffda RBX: 000055942ca49e20 RCX: 00007fdc23be2cc7
mai 19 11:39:25 curie kernel: RDX: 00007ffc8c792490 RSI: 0000000000005a03 RDI: 0000000000000003
mai 19 11:39:25 curie kernel: RBP: 00007ffc8c795e80 R08: 00000000ffffffff R09: 00007ffc8c792310
mai 19 11:39:25 curie kernel: R10: 000055942ca49e30 R11: 0000000000000246 R12: 00007ffc8c792490
mai 19 11:39:25 curie kernel: R13: 000055942ca49e30 R14: 000055942aed2c20 R15: 00007ffc8c795a40
I understand those are rather extreme conditions: I would fully expect the pool to stop working if the underlying drives disappear. What doesn't seem acceptable is that a command would completely hang like this.

References See the zfs documentation for more information about ZFS, and tubman for another installation and migration procedure.

3 November 2022

Alastair McKinstry: New Planning and Environment Court will reform planning appeals pro...

New Planning and Environment Court will reform planning appeals process #GreensInGovernment The Green Party has fulfilled an important Programme for Government commitment following the cabinet's decision to establish a new division of the High Court to specialise in environmental and planning issues. This is a major reform that will allow planning law to operate in a more efficient and environmentally friendly manner. Steven Matthews TD, Green Party Spokesperson for Planning and Local Government, explained the aim of the new court; The goal is to ensure that our judicial system has the capacity to decide cases as quickly as possible within a growing and increasingly complex body of Irish and EU environmental law. Timely access to justice is a cornersto ne of the effective rule of law and an international legal commitment Ireland has entered under the Aarhus Convention. This is extremely pressing given the urgent need to ease the pressure on the housing system and develop the infrastructure we need to transition to a zero-carbon economy." The Green Party is committed to ensuring the new Court will have the resources to fulfil its role. The Government will pass new legislation to allow the number of judges to be increased and provide the necessary exchequer funding to pay for these judges. The new Court will hear all the cases currently taken by the High Court s Commercial Planning and Strategic Infrastructure Development List and other major environmental cases. This is likely to include all major infrastructure cases as well as many cases relating to EU Environmental Law such as Environmental Impact Assessment, Strategic Environmental Assessment, Birds/Habitats, the Water Framework Directive and the Industrial Emissions Directive.

29 October 2022

Dirk Eddelbuettel: littler 0.3.17 on CRAN: Maintenance

max-heap image The eighteenth release of littler as a CRAN package just landed, following in the now sixteen year history (!!) as a package started by Jeff in 2006, and joined by me a few weeks later. littler is the first command-line interface for R as it predates Rscript. It allows for piping as well for shebang scripting via #!, uses command-line arguments more consistently and still starts faster. It also always loaded the methods package which Rscript only started to do in recent years. littler lives on Linux and Unix, has its difficulties on macOS due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems as a default were a good idea?) and simply does not exist on Windows (yet the build system could be extended see RInside for an existence proof, and volunteers are welcome!). See the FAQ vignette on how to add it to your PATH. A few examples are highlighted at the Github repo, as well as in the examples vignette. This release, coming just a few weeks since the last release in August, heeds to clang-15 a updates one signature to a proper interface. It also contains one kindly contributed patch updating install2.r (and installBioc.r) to cope with a change in R-devel. The full change description follows.

Changes in littler version 0.3.17 (2022-10-29)
  • Changes in package
    • An internal function prototype was updated for clang-15.
  • Changes in examples
    • The install2.r and installBioc. were updated for an update in R-devel (Tatsuya Shima and Dirk in #104).

My CRANberries service provides a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page, and also on the package docs website. The code is available via the GitHub repo, from tarballs and now of course also from its CRAN page and via install.packages("littler"). Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter. Comments and suggestions are welcome at the GitHub repo. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

29 September 2022

Antoine Beaupr : Detecting manual (and optimizing large) package installs in Puppet

Well this is a mouthful. I recently worked on a neat hack called puppet-package-check. It is designed to warn about manually installed packages, to make sure "everything is in Puppet". But it turns out it can (probably?) dramatically decrease the bootstrap time of Puppet bootstrap when it needs to install a large number of packages.

Detecting manual packages On a cleanly filed workstation, it looks like this:
root@emma:/home/anarcat/bin# ./puppet-package-check -v
listing puppet packages...
listing apt packages...
loading apt cache...
0 unmanaged packages found
A messy workstation will look like this:
root@curie:/home/anarcat/bin# ./puppet-package-check -v
listing puppet packages...
listing apt packages...
loading apt cache...
288 unmanaged packages found
apparmor-utils beignet-opencl-icd bridge-utils clustershell cups-pk-helper davfs2 dconf-cli dconf-editor dconf-gsettings-backend ddccontrol ddrescueview debmake debootstrap decopy dict-devil dict-freedict-eng-fra dict-freedict-eng-spa dict-freedict-fra-eng dict-freedict-spa-eng diffoscope dnsdiag dropbear-initramfs ebtables efibootmgr elpa-lua-mode entr eog evince figlet file file-roller fio flac flex font-manager fonts-cantarell fonts-inconsolata fonts-ipafont-gothic fonts-ipafont-mincho fonts-liberation fonts-monoid fonts-monoid-tight fonts-noto fonts-powerline fonts-symbola freeipmi freetype2-demos ftp fwupd-amd64-signed gallery-dl gcc-arm-linux-gnueabihf gcolor3 gcp gdisk gdm3 gdu gedit gedit-plugins gettext-base git-debrebase gnome-boxes gnote gnupg2 golang-any golang-docker-credential-helpers golang-golang-x-tools grub-efi-amd64-signed gsettings-desktop-schemas gsfonts gstreamer1.0-libav gstreamer1.0-plugins-base gstreamer1.0-plugins-good gstreamer1.0-plugins-ugly gstreamer1.0-pulseaudio gtypist gvfs-backends hackrf hashcat html2text httpie httping hugo humanfriendly iamerican-huge ibus ibus-gtk3 ibus-libpinyin ibus-pinyin im-config imediff img2pdf imv initramfs-tools input-utils installation-birthday internetarchive ipmitool iptables iptraf-ng jackd2 jupyter jupyter-nbextension-jupyter-js-widgets jupyter-qtconsole k3b kbtin kdialog keditbookmarks keepassxc kexec-tools keyboard-configuration kfind konsole krb5-locales kwin-x11 leiningen lightdm lintian linux-image-amd64 linux-perf lmodern lsb-base lvm2 lynx lz4json magic-wormhole mailscripts mailutils manuskript mat2 mate-notification-daemon mate-themes mime-support mktorrent mp3splt mpdris2 msitools mtp-tools mtree-netbsd mupdf nautilus nautilus-sendto ncal nd ndisc6 neomutt net-tools nethogs nghttp2-client nocache npm2deb ntfs-3g ntpdate nvme-cli nwipe obs-studio okular-extra-backends openstack-clients openstack-pkg-tools paprefs pass-extension-audit pcmanfm pdf-presenter-console pdf2svg percol pipenv playerctl plymouth plymouth-themes popularity-contest progress prometheus-node-exporter psensor pubpaste pulseaudio python3-ldap qjackctl qpdfview qrencode r-cran-ggplot2 r-cran-reshape2 rake restic rhash rpl rpm2cpio rs ruby ruby-dev ruby-feedparser ruby-magic ruby-mocha ruby-ronn rygel-playbin rygel-tracker s-tui sanoid saytime scrcpy scrcpy-server screenfetch scrot sdate sddm seahorse shim-signed sigil smartmontools smem smplayer sng sound-juicer sound-theme-freedesktop spectre-meltdown-checker sq ssh-audit sshuttle stress-ng strongswan strongswan-swanctl syncthing system-config-printer system-config-printer-common system-config-printer-udev systemd-bootchart systemd-container tardiff task-desktop task-english task-ssh-server tasksel tellico texinfo texlive-fonts-extra texlive-lang-cyrillic texlive-lang-french texlive-lang-german texlive-lang-italian texlive-xetex tftp-hpa thunar-archive-plugin tidy tikzit tint2 tintin++ tipa tpm2-tools traceroute tree trocla ucf udisks2 unifont unrar-free upower usbguard uuid-runtime vagrant-cachier vagrant-libvirt virt-manager vmtouch vorbis-tools w3m wamerican wamerican-huge wfrench whipper whohas wireshark xapian-tools xclip xdg-user-dirs-gtk xlax xmlto xsensors xserver-xorg xsltproc xxd xz-utils yubioath-desktop zathura zathura-pdf-poppler zenity zfs-dkms zfs-initramfs zfsutils-linux zip zlib1g zlib1g-dev
157 old: apparmor-utils clustershell davfs2 dconf-cli dconf-editor ddccontrol ddrescueview decopy dnsdiag ebtables efibootmgr elpa-lua-mode entr figlet file-roller fio flac flex font-manager freetype2-demos ftp gallery-dl gcc-arm-linux-gnueabihf gcolor3 gcp gdu gedit git-debrebase gnote golang-docker-credential-helpers golang-golang-x-tools gtypist hackrf hashcat html2text httpie httping hugo humanfriendly iamerican-huge ibus ibus-pinyin imediff input-utils internetarchive ipmitool iptraf-ng jackd2 jupyter-qtconsole k3b kbtin kdialog keditbookmarks keepassxc kexec-tools kfind konsole leiningen lightdm lynx lz4json magic-wormhole manuskript mat2 mate-notification-daemon mktorrent mp3splt msitools mtp-tools mtree-netbsd nautilus nautilus-sendto nd ndisc6 neomutt net-tools nethogs nghttp2-client nocache ntpdate nwipe obs-studio openstack-pkg-tools paprefs pass-extension-audit pcmanfm pdf-presenter-console pdf2svg percol pipenv playerctl qjackctl qpdfview qrencode r-cran-ggplot2 r-cran-reshape2 rake restic rhash rpl rpm2cpio rs ruby-feedparser ruby-magic ruby-mocha ruby-ronn s-tui saytime scrcpy screenfetch scrot sdate seahorse shim-signed sigil smem smplayer sng sound-juicer spectre-meltdown-checker sq ssh-audit sshuttle stress-ng system-config-printer system-config-printer-common tardiff tasksel tellico texlive-lang-cyrillic texlive-lang-french tftp-hpa tikzit tint2 tintin++ tpm2-tools traceroute tree unrar-free vagrant-cachier vagrant-libvirt vmtouch vorbis-tools w3m wamerican wamerican-huge wfrench whipper whohas xdg-user-dirs-gtk xlax xmlto xsensors xxd yubioath-desktop zenity zip
131 new: beignet-opencl-icd bridge-utils cups-pk-helper dconf-gsettings-backend debmake debootstrap dict-devil dict-freedict-eng-fra dict-freedict-eng-spa dict-freedict-fra-eng dict-freedict-spa-eng diffoscope dropbear-initramfs eog evince file fonts-cantarell fonts-inconsolata fonts-ipafont-gothic fonts-ipafont-mincho fonts-liberation fonts-monoid fonts-monoid-tight fonts-noto fonts-powerline fonts-symbola freeipmi fwupd-amd64-signed gdisk gdm3 gedit-plugins gettext-base gnome-boxes gnupg2 golang-any grub-efi-amd64-signed gsettings-desktop-schemas gsfonts gstreamer1.0-libav gstreamer1.0-plugins-base gstreamer1.0-plugins-good gstreamer1.0-plugins-ugly gstreamer1.0-pulseaudio gvfs-backends ibus-gtk3 ibus-libpinyin im-config img2pdf imv initramfs-tools installation-birthday iptables jupyter jupyter-nbextension-jupyter-js-widgets keyboard-configuration krb5-locales kwin-x11 lintian linux-image-amd64 linux-perf lmodern lsb-base lvm2 mailscripts mailutils mate-themes mime-support mpdris2 mupdf ncal npm2deb ntfs-3g nvme-cli okular-extra-backends openstack-clients plymouth plymouth-themes popularity-contest progress prometheus-node-exporter psensor pubpaste pulseaudio python3-ldap ruby ruby-dev rygel-playbin rygel-tracker sanoid scrcpy-server sddm smartmontools sound-theme-freedesktop strongswan strongswan-swanctl syncthing system-config-printer-udev systemd-bootchart systemd-container task-desktop task-english task-ssh-server texinfo texlive-fonts-extra texlive-lang-german texlive-lang-italian texlive-xetex thunar-archive-plugin tidy tipa trocla ucf udisks2 unifont upower usbguard uuid-runtime virt-manager wireshark xapian-tools xclip xserver-xorg xsltproc xz-utils zathura zathura-pdf-poppler zfs-dkms zfs-initramfs zfsutils-linux zlib1g zlib1g-dev
Yuck! That's a lot of shit to go through. Notice how the packages get sorted between "old" and "new" packages. This is because popcon is used as a tool to mark which packages are "old". If you have unmanaged packages, the "old" ones are likely things that you can uninstall, for example. If you don't have popcon installed, you'll also get this warning:
popcon stats not available: [Errno 2] No such file or directory: '/var/log/popularity-contest'
The error can otherwise be safely ignored, but you won't get "help" prioritizing the packages to add to your manifests. Note that the tool ignores packages that were "marked" (see apt-mark(8)) as automatically installed. This implies that you might have to do a little bit of cleanup the first time you run this, as Debian doesn't necessarily mark all of those packages correctly on first install. For example, here's how it looks like on a clean install, after Puppet ran:
root@angela:/home/anarcat# ./bin/puppet-package-check -v
listing puppet packages...
listing apt packages...
loading apt cache...
127 unmanaged packages found
ca-certificates console-setup cryptsetup-initramfs dbus file gcc-12-base gettext-base grub-common grub-efi-amd64 i3lock initramfs-tools iw keyboard-configuration krb5-locales laptop-detect libacl1 libapparmor1 libapt-pkg6.0 libargon2-1 libattr1 libaudit-common libaudit1 libblkid1 libbpf0 libbsd0 libbz2-1.0 libc6 libcap-ng0 libcap2 libcap2-bin libcom-err2 libcrypt1 libcryptsetup12 libdb5.3 libdebconfclient0 libdevmapper1.02.1 libedit2 libelf1 libext2fs2 libfdisk1 libffi8 libgcc-s1 libgcrypt20 libgmp10 libgnutls30 libgpg-error0 libgssapi-krb5-2 libhogweed6 libidn2-0 libip4tc2 libiw30 libjansson4 libjson-c5 libk5crypto3 libkeyutils1 libkmod2 libkrb5-3 libkrb5support0 liblocale-gettext-perl liblockfile-bin liblz4-1 liblzma5 libmd0 libmnl0 libmount1 libncurses6 libncursesw6 libnettle8 libnewt0.52 libnftables1 libnftnl11 libnl-3-200 libnl-genl-3-200 libnl-route-3-200 libnss-systemd libp11-kit0 libpam-systemd libpam0g libpcre2-8-0 libpcre3 libpcsclite1 libpopt0 libprocps8 libreadline8 libselinux1 libsemanage-common libsemanage2 libsepol2 libslang2 libsmartcols1 libss2 libssl1.1 libssl3 libstdc++6 libsystemd-shared libsystemd0 libtasn1-6 libtext-charwidth-perl libtext-iconv-perl libtext-wrapi18n-perl libtinfo6 libtirpc-common libtirpc3 libudev1 libunistring2 libuuid1 libxtables12 libxxhash0 libzstd1 linux-image-amd64 logsave lsb-base lvm2 media-types mlocate ncurses-term pass-extension-otp puppet python3-reportbug shim-signed tasksel ucf usr-is-merged util-linux-extra wpasupplicant xorg zlib1g
popcon stats not available: [Errno 2] No such file or directory: '/var/log/popularity-contest'
Normally, there should be unmanaged packages here. But because of the way Debian is installed, a lot of libraries and some core packages are marked as manually installed, and are of course not managed through Puppet. There are two solutions to this problem:
  • really manage everything in Puppet (argh)
  • mark packages as automatically installed
I typically chose the second path and mark a ton of stuff as automatic. Then either they will be auto-removed, or will stop being listed. In the above scenario, one could mark all libraries as automatically installed with:
apt-mark auto $(./bin/puppet-package-check   grep -o 'lib[^ ]*')
... but if you trust that most of that stuff is actually garbage that you don't really want installed anyways, you could just mark it all as automatically installed:
apt-mark auto $(./bin/puppet-package-check)
In my case, that ended up keeping basically all libraries (because of course they're installed for some reason) and auto-removing this:
dh-dkms discover-data dkms libdiscover2 libjsoncpp25 libssl1.1 linux-headers-amd64 mlocate pass-extension-otp pass-otp plocate x11-apps x11-session-utils xinit xorg
You'll notice xorg in there: yep, that's bad. Not what I wanted. But for some reason, on other workstations, I did not actually have xorg installed. Turns out having xserver-xorg is enough, and that one has dependencies. So now I guess I just learned to stop worrying and live without X(org).

Optimizing large package installs But that, of course, is not all. Why make things simple when you can have an unreadable title that is trying to be both syntactically correct and click-baity enough to flatter my vain ego? Right. One of the challenges in bootstrapping Puppet with large package lists is that it's slow. Puppet lists packages as individual resources and will basically run apt install $PKG on every package in the manifest, one at a time. While the overhead of apt is generally small, when you add things like apt-listbugs, apt-listchanges, needrestart, triggers and so on, it can take forever setting up a new host. So for initial installs, it can actually makes sense to skip the queue and just install everything in one big batch. And because the above tool inspects the packages installed by Puppet, you can run it against a catalog and have a full lists of all the packages Puppet would install, even before I even had Puppet running. So when reinstalling my laptop, I basically did this:
apt install puppet-agent/experimental
puppet agent --test --noop
apt install $(./puppet-package-check --debug \
    2>&1   grep ^puppet\ packages 
      sed 's/puppet packages://;s/ /\n/g'
      grep -v -e onionshare -e golint -e git-sizer -e github-backup -e hledger -e xsane -e audacity -e chirp -e elpa-flycheck -e elpa-lsp-ui -e yubikey-manager -e git-annex -e hopenpgp-tools -e puppet
) puppet-agent/experimental
That massive grep was because there are currently a lot of packages missing from bookworm. Those are all packages that I have in my catalog but that still haven't made it to bookworm. Sad, I know. I eventually worked around that by adding bullseye sources so that the Puppet manifest actually ran. The point here is that this improves the Puppet run time a lot. All packages get installed at once, and you get a nice progress bar. Then you actually run Puppet to deploy configurations and all the other goodies:
puppet agent --test
I wish I could tell you how much faster that ran. I don't know, and I will not go through a full reinstall just to please your curiosity. The only hard number I have is that it installed 444 packages (which exploded in 10,191 packages with dependencies) in a mere 10 minutes. That might also be with the packages already downloaded. In any case, I have that gut feeling it's faster, so you'll have to just trust my gut. It is, after all, much more important than you might think.

Similar work The blueprint system is something similar to this:
It figures out what you ve done manually, stores it locally in a Git repository, generates code that s able to recreate your efforts, and helps you deploy those changes to production
That tool has unfortunately been abandoned for a decade at this point. Also note that the AutoRemove::RecommendsImportant and AutoRemove::SuggestsImportant are relevant here. If it is set to true (the default), a package will not be removed if it is (respectively) a Recommends or Suggests of another package (as opposed to the normal Depends). In other words, if you want to also auto-remove packages that are only Suggests, you would, for example, add this to apt.conf:
AutoRemove::SuggestsImportant false;
Paul Wise has tried to make the Debian installer and debootstrap properly mark packages as automatically installed in the past, but his bug reports were rejected. The other suggestions in this section are also from Paul, thanks!

25 September 2022

Sergio Talens-Oliag: Kubernetes Static Content Server

This post describes how I ve put together a simple static content server for kubernetes clusters using a Pod with a persistent volume and multiple containers: an sftp server to manage contents, a web server to publish them with optional access control and another one to run scripts which need access to the volume filesystem. The sftp server runs using MySecureShell, the web server is nginx and the script runner uses the webhook tool to publish endpoints to call them (the calls will come from other Pods that run backend servers or are executed from Jobs or CronJobs).

HistoryThe system was developed because we had a NodeJS API with endpoints to upload files and store them on S3 compatible services that were later accessed via HTTPS, but the requirements changed and we needed to be able to publish folders instead of individual files using their original names and apply access restrictions using our API. Thinking about our requirements the use of a regular filesystem to keep the files and folders was a good option, as uploading and serving files is simple. For the upload I decided to use the sftp protocol, mainly because I already had an sftp container image based on mysecureshell prepared; once we settled on that we added sftp support to the API server and configured it to upload the files to our server instead of using S3 buckets. To publish the files we added a nginx container configured to work as a reverse proxy that uses the ngx_http_auth_request_module to validate access to the files (the sub request is configurable, in our deployment we have configured it to call our API to check if the user can access a given URL). Finally we added a third container when we needed to execute some tasks directly on the filesystem (using kubectl exec with the existing containers did not seem a good idea, as that is not supported by CronJobs objects, for example). The solution we found avoiding the NIH Syndrome (i.e. write our own tool) was to use the webhook tool to provide the endpoints to call the scripts; for now we have three:
  • one to get the disc usage of a PATH,
  • one to hardlink all the files that are identical on the filesystem,
  • one to copy files and folders from S3 buckets to our filesystem.

Container definitions

mysecureshellThe mysecureshell container can be used to provide an sftp service with multiple users (although the files are owned by the same UID and GID) using standalone containers (launched with docker or podman) or in an orchestration system like kubernetes, as we are going to do here. The image is generated using the following Dockerfile:
ARG ALPINE_VERSION=3.16.2
FROM alpine:$ALPINE_VERSION as builder
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
RUN apk update &&\
 apk add --no-cache alpine-sdk git musl-dev &&\
 git clone https://github.com/sto/mysecureshell.git &&\
 cd mysecureshell &&\
 ./configure --prefix=/usr --sysconfdir=/etc --mandir=/usr/share/man\
 --localstatedir=/var --with-shutfile=/var/lib/misc/sftp.shut --with-debug=2 &&\
 make all && make install &&\
 rm -rf /var/cache/apk/*
FROM alpine:$ALPINE_VERSION
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
COPY --from=builder /usr/bin/mysecureshell /usr/bin/mysecureshell
COPY --from=builder /usr/bin/sftp-* /usr/bin/
RUN apk update &&\
 apk add --no-cache openssh shadow pwgen &&\
 sed -i -e "s ^.*\(AuthorizedKeysFile\).*$ \1 /etc/ssh/auth_keys/%u "\
 /etc/ssh/sshd_config &&\
 mkdir /etc/ssh/auth_keys &&\
 cat /dev/null > /etc/motd &&\
 add-shell '/usr/bin/mysecureshell' &&\
 rm -rf /var/cache/apk/*
COPY bin/* /usr/local/bin/
COPY etc/sftp_config /etc/ssh/
COPY entrypoint.sh /
EXPOSE 22
VOLUME /sftp
ENTRYPOINT ["/entrypoint.sh"]
CMD ["server"]
The /etc/sftp_config file is used to configure the mysecureshell server to have all the user homes under /sftp/data, only allow them to see the files under their home directories as if it were at the root of the server and close idle connections after 5m of inactivity:
etc/sftp_config
# Default mysecureshell configuration
<Default>
   # All users will have access their home directory under /sftp/data
   Home /sftp/data/$USER
   # Log to a file inside /sftp/logs/ (only works when the directory exists)
   LogFile /sftp/logs/mysecureshell.log
   # Force users to stay in their home directory
   StayAtHome true
   # Hide Home PATH, it will be shown as /
   VirtualChroot true
   # Hide real file/directory owner (just change displayed permissions)
   DirFakeUser true
   # Hide real file/directory group (just change displayed permissions)
   DirFakeGroup true
   # We do not want users to keep forever their idle connection
   IdleTimeOut 5m
</Default>
# vim: ts=2:sw=2:et
The entrypoint.sh script is the one responsible to prepare the container for the users included on the /secrets/user_pass.txt file (creates the users with their HOME directories under /sftp/data and a /bin/false shell and creates the key files from /secrets/user_keys.txt if available). The script expects a couple of environment variables:
  • SFTP_UID: UID used to run the daemon and for all the files, it has to be different than 0 (all the files managed by this daemon are going to be owned by the same user and group, even if the remote users are different).
  • SFTP_GID: GID used to run the daemon and for all the files, it has to be different than 0.
And can use the SSH_PORT and SSH_PARAMS values if present. It also requires the following files (they can be mounted as secrets in kubernetes):
  • /secrets/host_keys.txt: Text file containing the ssh server keys in mime format; the file is processed using the reformime utility (the one included on busybox) and can be generated using the gen-host-keys script included on the container (it uses ssh-keygen and makemime).
  • /secrets/user_pass.txt: Text file containing lines of the form username:password_in_clear_text (only the users included on this file are available on the sftp server, in fact in our deployment we use only the scs user for everything).
And optionally can use another one:
  • /secrets/user_keys.txt: Text file that contains lines of the form username:public_ssh_ed25519_or_rsa_key; the public keys are installed on the server and can be used to log into the sftp server if the username exists on the user_pass.txt file.
The contents of the entrypoint.sh script are:
entrypoint.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
# Expects SSH_UID & SSH_GID on the environment and uses the value of the
# SSH_PORT & SSH_PARAMS variables if present
# SSH_PARAMS
SSH_PARAMS="-D -e -p $ SSH_PORT:=22  $ SSH_PARAMS "
# Fixed values
# DIRECTORIES
HOME_DIR="/sftp/data"
CONF_FILES_DIR="/secrets"
AUTH_KEYS_PATH="/etc/ssh/auth_keys"
# FILES
HOST_KEYS="$CONF_FILES_DIR/host_keys.txt"
USER_KEYS="$CONF_FILES_DIR/user_keys.txt"
USER_PASS="$CONF_FILES_DIR/user_pass.txt"
USER_SHELL_CMD="/usr/bin/mysecureshell"
# TYPES
HOST_KEY_TYPES="dsa ecdsa ed25519 rsa"
# ---------
# FUNCTIONS
# ---------
# Validate HOST_KEYS, USER_PASS, SFTP_UID and SFTP_GID
_check_environment()  
  # Check the ssh server keys ... we don't boot if we don't have them
  if [ ! -f "$HOST_KEYS" ]; then
    cat <<EOF
We need the host keys on the '$HOST_KEYS' file to proceed.
Call the 'gen-host-keys' script to create and export them on a mime file.
EOF
    exit 1
  fi
  # Check that we have users ... if we don't we can't continue
  if [ ! -f "$USER_PASS" ]; then
    cat <<EOF
We need at least the '$USER_PASS' file to provision users.
Call the 'gen-users-tar' script to create a tar file to create an archive that
contains public and private keys for users, a 'user_keys.txt' with the public
keys of the users and a 'user_pass.txt' file with random passwords for them 
(pass the list of usernames to it).
EOF
    exit 1
  fi
  # Check SFTP_UID
  if [ -z "$SFTP_UID" ]; then
    echo "The 'SFTP_UID' can't be empty, pass a 'GID'."
    exit 1
  fi
  if [ "$SFTP_UID" -eq "0" ]; then
    echo "The 'SFTP_UID' can't be 0, use a different 'UID'"
    exit 1
  fi
  # Check SFTP_GID
  if [ -z "$SFTP_GID" ]; then
    echo "The 'SFTP_GID' can't be empty, pass a 'GID'."
    exit 1
  fi
  if [ "$SFTP_GID" -eq "0" ]; then
    echo "The 'SFTP_GID' can't be 0, use a different 'GID'"
    exit 1
  fi
 
# Adjust ssh host keys
_setup_host_keys()  
  opwd="$(pwd)"
  tmpdir="$(mktemp -d)"
  cd "$tmpdir"
  ret="0"
  reformime <"$HOST_KEYS"   ret="1"
  for kt in $HOST_KEY_TYPES; do
    key="ssh_host_$ kt _key"
    pub="ssh_host_$ kt _key.pub"
    if [ ! -f "$key" ]; then
      echo "Missing '$key' file"
      ret="1"
    fi
    if [ ! -f "$pub" ]; then
      echo "Missing '$pub' file"
      ret="1"
    fi
    if [ "$ret" -ne "0" ]; then
      continue
    fi
    cat "$key" >"/etc/ssh/$key"
    chmod 0600 "/etc/ssh/$key"
    chown root:root "/etc/ssh/$key"
    cat "$pub" >"/etc/ssh/$pub"
    chmod 0600 "/etc/ssh/$pub"
    chown root:root "/etc/ssh/$pub"
  done
  cd "$opwd"
  rm -rf "$tmpdir"
  return "$ret"
 
# Create users
_setup_user_pass()  
  opwd="$(pwd)"
  tmpdir="$(mktemp -d)"
  cd "$tmpdir"
  ret="0"
  [ -d "$HOME_DIR" ]   mkdir "$HOME_DIR"
  # Make sure the data dir can be managed by the sftp user
  chown "$SFTP_UID:$SFTP_GID" "$HOME_DIR"
  # Allow the user (and root) to create directories inside the $HOME_DIR, if
  # we don't allow it the directory creation fails on EFS (AWS)
  chmod 0755 "$HOME_DIR"
  # Create users
  echo "sftp:sftp:$SFTP_UID:$SFTP_GID:::/bin/false" >"newusers.txt"
  sed -n "/^[^#]/   s/:/ /p  " "$USER_PASS"   while read -r _u _p; do
    echo "$_u:$_p:$SFTP_UID:$SFTP_GID::$HOME_DIR/$_u:$USER_SHELL_CMD"
  done >>"newusers.txt"
  newusers --badnames newusers.txt
  # Disable write permission on the directory to forbid remote sftp users to
  # remove their own root dir (they have already done it); we adjust that
  # here to avoid issues with EFS (see before)
  chmod 0555 "$HOME_DIR"
  # Clean up the tmpdir
  cd "$opwd"
  rm -rf "$tmpdir"
  return "$ret"
 
# Adjust user keys
_setup_user_keys()  
  if [ -f "$USER_KEYS" ]; then
    sed -n "/^[^#]/   s/:/ /p  " "$USER_KEYS"   while read -r _u _k; do
      echo "$_k" >>"$AUTH_KEYS_PATH/$_u"
    done
  fi
 
# Main function
exec_sshd()  
  _check_environment
  _setup_host_keys
  _setup_user_pass
  _setup_user_keys
  echo "Running: /usr/sbin/sshd $SSH_PARAMS"
  # shellcheck disable=SC2086
  exec /usr/sbin/sshd -D $SSH_PARAMS
 
# ----
# MAIN
# ----
case "$1" in
"server") exec_sshd ;;
*) exec "$@" ;;
esac
# vim: ts=2:sw=2:et
The container also includes a couple of auxiliary scripts, the first one can be used to generate the host_keys.txt file as follows:
$ docker run --rm stodh/mysecureshell gen-host-keys > host_keys.txt
Where the script is as simple as:
bin/gen-host-keys
#!/bin/sh
set -e
# Generate new host keys
ssh-keygen -A >/dev/null
# Replace hostname
sed -i -e 's/@.*$/@mysecureshell/' /etc/ssh/ssh_host_*_key.pub
# Print in mime format (stdout)
makemime /etc/ssh/ssh_host_*
# vim: ts=2:sw=2:et
And there is another script to generate a .tar file that contains auth data for the list of usernames passed to it (the file contains a user_pass.txt file with random passwords for the users, public and private ssh keys for them and the user_keys.txt file that matches the generated keys). To generate a tar file for the user scs we can execute the following:
$ docker run --rm stodh/mysecureshell gen-users-tar scs > /tmp/scs-users.tar
To see the contents and the text inside the user_pass.txt file we can do:
$ tar tvf /tmp/scs-users.tar
-rw-r--r-- root/root        21 2022-09-11 15:55 user_pass.txt
-rw-r--r-- root/root       822 2022-09-11 15:55 user_keys.txt
-rw------- root/root       387 2022-09-11 15:55 id_ed25519-scs
-rw-r--r-- root/root        85 2022-09-11 15:55 id_ed25519-scs.pub
-rw------- root/root      3357 2022-09-11 15:55 id_rsa-scs
-rw------- root/root      3243 2022-09-11 15:55 id_rsa-scs.pem
-rw-r--r-- root/root       729 2022-09-11 15:55 id_rsa-scs.pub
$ tar xfO /tmp/scs-users.tar user_pass.txt
scs:20JertRSX2Eaar4x
The source of the script is:
bin/gen-users-tar
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
USER_KEYS_FILE="user_keys.txt"
USER_PASS_FILE="user_pass.txt"
# ---------
# MAIN CODE
# ---------
# Generate user passwords and keys, return 1 if no username is received
if [ "$#" -eq "0" ]; then
  return 1
fi
opwd="$(pwd)"
tmpdir="$(mktemp -d)"
cd "$tmpdir"
for u in "$@"; do
  ssh-keygen -q -a 100 -t ed25519 -f "id_ed25519-$u" -C "$u" -N ""
  ssh-keygen -q -a 100 -b 4096 -t rsa -f "id_rsa-$u" -C "$u" -N ""
  # Legacy RSA private key format
  cp -a "id_rsa-$u" "id_rsa-$u.pem"
  ssh-keygen -q -p -m pem -f "id_rsa-$u.pem" -N "" -P "" >/dev/null
  chmod 0600 "id_rsa-$u.pem"
  echo "$u:$(pwgen -s 16 1)" >>"$USER_PASS_FILE"
  echo "$u:$(cat "id_ed25519-$u.pub")" >>"$USER_KEYS_FILE"
  echo "$u:$(cat "id_rsa-$u.pub")" >>"$USER_KEYS_FILE"
done
tar cf - "$USER_PASS_FILE" "$USER_KEYS_FILE" id_* 2>/dev/null
cd "$opwd"
rm -rf "$tmpdir"
# vim: ts=2:sw=2:et

nginx-scsThe nginx-scs container is generated using the following Dockerfile:
ARG NGINX_VERSION=1.23.1
FROM nginx:$NGINX_VERSION
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
RUN rm -f /docker-entrypoint.d/*
COPY docker-entrypoint.d/* /docker-entrypoint.d/
Basically we are removing the existing docker-entrypoint.d scripts from the standard image and adding a new one that configures the web server as we want using a couple of environment variables:
  • AUTH_REQUEST_URI: URL to use for the auth_request, if the variable is not found on the environment auth_request is not used.
  • HTML_ROOT: Base directory of the web server, if not passed the default /usr/share/nginx/html is used.
Note that if we don t pass the variables everything works as if we were using the original nginx image. The contents of the configuration script are:
docker-entrypoint.d/10-update-default-conf.sh
#!/bin/sh
# Replace the default.conf nginx file by our own version.
set -e
if [ -z "$HTML_ROOT" ]; then
  HTML_ROOT="/usr/share/nginx/html"
fi
if [ "$AUTH_REQUEST_URI" ]; then
  cat >/etc/nginx/conf.d/default.conf <<EOF
server  
  listen       80;
  server_name  localhost;
  location /  
    auth_request /.auth;
    root  $HTML_ROOT;
    index index.html index.htm;
   
  location /.auth  
    internal;
    proxy_pass $AUTH_REQUEST_URI;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-URI \$request_uri;
   
  error_page   500 502 503 504  /50x.html;
  location = /50x.html  
    root /usr/share/nginx/html;
   
 
EOF
else
  cat >/etc/nginx/conf.d/default.conf <<EOF
server  
  listen       80;
  server_name  localhost;
  location /  
    root  $HTML_ROOT;
    index index.html index.htm;
   
  error_page   500 502 503 504  /50x.html;
  location = /50x.html  
    root /usr/share/nginx/html;
   
 
EOF
fi
# vim: ts=2:sw=2:et
As we will see later the idea is to use the /sftp/data or /sftp/data/scs folder as the root of the web published by this container and create an Ingress object to provide access to it outside of our kubernetes cluster.

webhook-scsThe webhook-scs container is generated using the following Dockerfile:
ARG ALPINE_VERSION=3.16.2
ARG GOLANG_VERSION=alpine3.16
FROM golang:$GOLANG_VERSION AS builder
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
ENV WEBHOOK_VERSION 2.8.0
ENV WEBHOOK_PR 549
ENV S3FS_VERSION v1.91
WORKDIR /go/src/github.com/adnanh/webhook
RUN apk update &&\
 apk add --no-cache -t build-deps curl libc-dev gcc libgcc patch
RUN curl -L --silent -o webhook.tar.gz\
 https://github.com/adnanh/webhook/archive/$ WEBHOOK_VERSION .tar.gz &&\
 tar xzf webhook.tar.gz --strip 1 &&\
 curl -L --silent -o $ WEBHOOK_PR .patch\
 https://patch-diff.githubusercontent.com/raw/adnanh/webhook/pull/$ WEBHOOK_PR .patch &&\
 patch -p1 < $ WEBHOOK_PR .patch &&\
 go get -d && \
 go build -o /usr/local/bin/webhook
WORKDIR /src/s3fs-fuse
RUN apk update &&\
 apk add ca-certificates build-base alpine-sdk libcurl automake autoconf\
 libxml2-dev libressl-dev mailcap fuse-dev curl-dev
RUN curl -L --silent -o s3fs.tar.gz\
 https://github.com/s3fs-fuse/s3fs-fuse/archive/refs/tags/$S3FS_VERSION.tar.gz &&\
 tar xzf s3fs.tar.gz --strip 1 &&\
 ./autogen.sh &&\
 ./configure --prefix=/usr/local &&\
 make -j && \
 make install
FROM alpine:$ALPINE_VERSION
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
WORKDIR /webhook
RUN apk update &&\
 apk add --no-cache ca-certificates mailcap fuse libxml2 libcurl libgcc\
 libstdc++ rsync util-linux-misc &&\
 rm -rf /var/cache/apk/*
COPY --from=builder /usr/local/bin/webhook /usr/local/bin/webhook
COPY --from=builder /usr/local/bin/s3fs /usr/local/bin/s3fs
COPY entrypoint.sh /
COPY hooks/* ./hooks/
EXPOSE 9000
ENTRYPOINT ["/entrypoint.sh"]
CMD ["server"]
Again, we use a multi-stage build because in production we wanted to support a functionality that is not already on the official versions (streaming the command output as a response instead of waiting until the execution ends); this time we build the image applying the PATCH included on this pull request against a released version of the source instead of creating a fork. The entrypoint.sh script is used to generate the webhook configuration file for the existing hooks using environment variables (basically the WEBHOOK_WORKDIR and the *_TOKEN variables) and launch the webhook service:
entrypoint.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
WEBHOOK_BIN="$ WEBHOOK_BIN:-/webhook/hooks "
WEBHOOK_YML="$ WEBHOOK_YML:-/webhook/scs.yml "
WEBHOOK_OPTS="$ WEBHOOK_OPTS:--verbose "
# ---------
# FUNCTIONS
# ---------
print_du_yml()  
  cat <<EOF
- id: du
  execute-command: '$WEBHOOK_BIN/du.sh'
  command-working-directory: '$WORKDIR'
  response-headers:
  - name: 'Content-Type'
    value: 'application/json'
  http-methods: ['GET']
  include-command-output-in-response: true
  include-command-output-in-response-on-error: true
  pass-arguments-to-command:
  - source: 'url'
    name: 'path'
  pass-environment-to-command:
  - source: 'string'
    envname: 'OUTPUT_FORMAT'
    name: 'json'
EOF
 
print_hardlink_yml()  
  cat <<EOF
- id: hardlink
  execute-command: '$WEBHOOK_BIN/hardlink.sh'
  command-working-directory: '$WORKDIR'
  http-methods: ['GET']
  include-command-output-in-response: true
  include-command-output-in-response-on-error: true
EOF
 
print_s3sync_yml()  
  cat <<EOF
- id: s3sync
  execute-command: '$WEBHOOK_BIN/s3sync.sh'
  command-working-directory: '$WORKDIR'
  http-methods: ['POST']
  include-command-output-in-response: true
  include-command-output-in-response-on-error: true
  pass-environment-to-command:
  - source: 'payload'
    envname: 'AWS_KEY'
    name: 'aws.key'
  - source: 'payload'
    envname: 'AWS_SECRET_KEY'
    name: 'aws.secret_key'
  - source: 'payload'
    envname: 'S3_BUCKET'
    name: 's3.bucket'
  - source: 'payload'
    envname: 'S3_REGION'
    name: 's3.region'
  - source: 'payload'
    envname: 'S3_PATH'
    name: 's3.path'
  - source: 'payload'
    envname: 'SCS_PATH'
    name: 'scs.path'
  stream-command-output: true
EOF
 
print_token_yml()  
  if [ "$1" ]; then
    cat << EOF
  trigger-rule:
    match:
      type: 'value'
      value: '$1'
      parameter:
        source: 'header'
        name: 'X-Webhook-Token'
EOF
  fi
 
exec_webhook()  
  # Validate WORKDIR
  if [ -z "$WEBHOOK_WORKDIR" ]; then
    echo "Must define the WEBHOOK_WORKDIR variable!" >&2
    exit 1
  fi
  WORKDIR="$(realpath "$WEBHOOK_WORKDIR" 2>/dev/null)"   true
  if [ ! -d "$WORKDIR" ]; then
    echo "The WEBHOOK_WORKDIR '$WEBHOOK_WORKDIR' is not a directory!" >&2
    exit 1
  fi
  # Get TOKENS, if the DU_TOKEN or HARDLINK_TOKEN is defined that is used, if
  # not if the COMMON_TOKEN that is used and in other case no token is checked
  # (that is the default)
  DU_TOKEN="$ DU_TOKEN:-$COMMON_TOKEN "
  HARDLINK_TOKEN="$ HARDLINK_TOKEN:-$COMMON_TOKEN "
  S3_TOKEN="$ S3_TOKEN:-$COMMON_TOKEN "
  # Create webhook configuration
    
    print_du_yml
    print_token_yml "$DU_TOKEN"
    echo ""
    print_hardlink_yml
    print_token_yml "$HARDLINK_TOKEN"
    echo ""
    print_s3sync_yml
    print_token_yml "$S3_TOKEN"
   >"$WEBHOOK_YML"
  # Run the webhook command
  # shellcheck disable=SC2086
  exec webhook -hooks "$WEBHOOK_YML" $WEBHOOK_OPTS
 
# ----
# MAIN
# ----
case "$1" in
"server") exec_webhook ;;
*) exec "$@" ;;
esac
The entrypoint.sh script generates the configuration file for the webhook server calling functions that print a yaml section for each hook and optionally adds rules to validate access to them comparing the value of a X-Webhook-Token header against predefined values. The expected token values are taken from environment variables, we can define a token variable for each hook (DU_TOKEN, HARDLINK_TOKEN or S3_TOKEN) and a fallback value (COMMON_TOKEN); if no token variable is defined for a hook no check is done and everybody can call it. The Hook Definition documentation explains the options you can use for each hook, the ones we have right now do the following:
  • du: runs on the $WORKDIR directory, passes as first argument to the script the value of the path query parameter and sets the variable OUTPUT_FORMAT to the fixed value json (we use that to print the output of the script in JSON format instead of text).
  • hardlink: runs on the $WORKDIR directory and takes no parameters.
  • s3sync: runs on the $WORKDIR directory and sets a lot of environment variables from values read from the JSON encoded payload sent by the caller (all the values must be sent by the caller even if they are assigned an empty value, if they are missing the hook fails without calling the script); we also set the stream-command-output value to true to make the script show its output as it is working (we patched the webhook source to be able to use this option).

The du hook scriptThe du hook script code checks if the argument passed is a directory, computes its size using the du command and prints the results in text format or as a JSON dictionary:
hooks/du.sh
#!/bin/sh
set -e
# Script to print disk usage for a PATH inside the scs folder
# ---------
# FUNCTIONS
# ---------
print_error()  
  if [ "$OUTPUT_FORMAT" = "json" ]; then
    echo " \"error\":\"$*\" "
  else
    echo "$*" >&2
  fi
  exit 1
 
usage()  
  if [ "$OUTPUT_FORMAT" = "json" ]; then
    echo " \"error\":\"Pass arguments as '?path=XXX\" "
  else
    echo "Usage: $(basename "$0") PATH" >&2
  fi
  exit 1
 
# ----
# MAIN
# ----
if [ "$#" -eq "0" ]   [ -z "$1" ]; then
  usage
fi
if [ "$1" = "." ]; then
  DU_PATH="./"
else
  DU_PATH="$(find . -name "$1" -mindepth 1 -maxdepth 1)"   true
fi
if [ -z "$DU_PATH" ]   [ ! -d "$DU_PATH/." ]; then
  print_error "The provided PATH ('$1') is not a directory"
fi
# Print disk usage in bytes for the given PATH
OUTPUT="$(du -b -s "$DU_PATH")"
if [ "$OUTPUT_FORMAT" = "json" ]; then
  # Format output as  "path":"PATH","bytes":"BYTES" 
  echo "$OUTPUT"  
    sed -e "s%^\(.*\)\t.*/\(.*\)$% \"path\":\"\2\",\"bytes\":\"\1\" %"  
    tr -d '\n'
else
  # Print du output as is
  echo "$OUTPUT"
fi
# vim: ts=2:sw=2:et:ai:sts=2

The s3sync hook scriptThe s3sync hook script uses the s3fs tool to mount a bucket and synchronise data between a folder inside the bucket and a directory on the filesystem using rsync; all values needed to execute the task are taken from environment variables:
hooks/s3sync.sh
#!/bin/ash
set -euo pipefail
set -o errexit
set -o errtrace
# Functions
finish()  
  ret="$1"
  echo ""
  echo "Script exit code: $ret"
  exit "$ret"
 
# Check variables
if [ -z "$AWS_KEY" ]   [ -z "$AWS_SECRET_KEY" ]   [ -z "$S3_BUCKET" ]  
  [ -z "$S3_PATH" ]   [ -z "$SCS_PATH" ]; then
  [ "$AWS_KEY" ]   echo "Set the AWS_KEY environment variable"
  [ "$AWS_SECRET_KEY" ]   echo "Set the AWS_SECRET_KEY environment variable"
  [ "$S3_BUCKET" ]   echo "Set the S3_BUCKET environment variable"
  [ "$S3_PATH" ]   echo "Set the S3_PATH environment variable"
  [ "$SCS_PATH" ]   echo "Set the SCS_PATH environment variable"
  finish 1
fi
if [ "$S3_REGION" ] && [ "$S3_REGION" != "us-east-1" ]; then
  EP_URL="endpoint=$S3_REGION,url=https://s3.$S3_REGION.amazonaws.com"
else
  EP_URL="endpoint=us-east-1"
fi
# Prepare working directory
WORK_DIR="$(mktemp -p "$HOME" -d)"
MNT_POINT="$WORK_DIR/s3data"
PASSWD_S3FS="$WORK_DIR/.passwd-s3fs"
# Check the moutpoint
if [ ! -d "$MNT_POINT" ]; then
  mkdir -p "$MNT_POINT"
elif mountpoint "$MNT_POINT"; then
  echo "There is already something mounted on '$MNT_POINT', aborting!"
  finish 1
fi
# Create password file
touch "$PASSWD_S3FS"
chmod 0400 "$PASSWD_S3FS"
echo "$AWS_KEY:$AWS_SECRET_KEY" >"$PASSWD_S3FS"
# Mount s3 bucket as a filesystem
s3fs -o dbglevel=info,retries=5 -o "$EP_URL" -o "passwd_file=$PASSWD_S3FS" \
  "$S3_BUCKET" "$MNT_POINT"
echo "Mounted bucket '$S3_BUCKET' on '$MNT_POINT'"
# Remove the password file, just in case
rm -f "$PASSWD_S3FS"
# Check source PATH
ret="0"
SRC_PATH="$MNT_POINT/$S3_PATH"
if [ ! -d "$SRC_PATH" ]; then
  echo "The S3_PATH '$S3_PATH' can't be found!"
  ret=1
fi
# Compute SCS_UID & SCS_GID (by default based on the working directory owner)
SCS_UID="$ SCS_UID:=$(stat -c "%u" "." 2>/dev/null) "   true
SCS_GID="$ SCS_GID:=$(stat -c "%g" "." 2>/dev/null) "   true
# Check destination PATH
DST_PATH="./$SCS_PATH"
if [ "$ret" -eq "0" ] && [ -d "$DST_PATH" ]; then
  mkdir -p "$DST_PATH"   ret="$?"
fi
# Copy using rsync
if [ "$ret" -eq "0" ]; then
  rsync -rlptv --chown="$SCS_UID:$SCS_GID" --delete --stats \
    "$SRC_PATH/" "$DST_PATH/"   ret="$?"
fi
# Unmount the S3 bucket
umount -f "$MNT_POINT"
echo "Called umount for '$MNT_POINT'"
# Remove mount point dir
rmdir "$MNT_POINT"
# Remove WORK_DIR
rmdir "$WORK_DIR"
# We are done
finish "$ret"
# vim: ts=2:sw=2:et:ai:sts=2

Deployment objectsThe system is deployed as a StatefulSet with one replica. Our production deployment is done on AWS and to be able to scale we use EFS for our PersistenVolume; the idea is that the volume has no size limit, its AccessMode can be set to ReadWriteMany and we can mount it from multiple instances of the Pod without issues, even if they are in different availability zones. For development we use k3d and we are also able to scale the StatefulSet for testing because we use a ReadWriteOnce PVC, but it points to a hostPath that is backed up by a folder that is mounted on all the compute nodes, so in reality Pods in different k3d nodes use the same folder on the host.

secrets.yamlThe secrets file contains the files used by the mysecureshell container that can be generated using kubernetes pods as follows (we are only creating the scs user):
$ kubectl run "mysecureshell" --restart='Never' --quiet --rm --stdin \
  --image "stodh/mysecureshell:latest" -- gen-host-keys >"./host_keys.txt"
$ kubectl run "mysecureshell" --restart='Never' --quiet --rm --stdin \
  --image "stodh/mysecureshell:latest" -- gen-users-tar scs >"./users.tar"
Once we have the files we can generate the secrets.yaml file as follows:
$ tar xf ./users.tar user_keys.txt user_pass.txt
$ kubectl --dry-run=client -o yaml create secret generic "scs-secret" \
  --from-file="host_keys.txt=host_keys.txt" \
  --from-file="user_keys.txt=user_keys.txt" \
  --from-file="user_pass.txt=user_pass.txt" > ./secrets.yaml
The resulting secrets.yaml will look like the following file (the base64 would match the content of the files, of course):
secrets.yaml
apiVersion: v1
data:
  host_keys.txt: TWlt...
  user_keys.txt: c2Nz...
  user_pass.txt: c2Nz...
kind: Secret
metadata:
  creationTimestamp: null
  name: scs-secret

pvc.yamlThe persistent volume claim for a simple deployment (one with only one instance of the statefulSet) can be as simple as this:
pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scs-pvc
  labels:
    app.kubernetes.io/name: scs
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
On this definition we don t set the storageClassName to use the default one.

Volumes in our development environment (k3d)In our development deployment we create the following PersistentVolume as required by the Local Persistence Volume Static Provisioner (note that the /volumes/scs-pv has to be created by hand, in our k3d system we mount the same host directory on the /volumes path of all the nodes and create the scs-pv directory by hand before deploying the persistent volume):
k3d-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: scs-pv
  labels:
    app.kubernetes.io/name: scs
spec:
  capacity:
    storage: 8Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  claimRef:
    name: scs-pvc
  storageClassName: local-storage
  local:
    path: /volumes/scs-pv
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - k3s
And to make sure that everything works as expected we update the PVC definition to add the right storageClassName:
k3d-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scs-pvc
  labels:
    app.kubernetes.io/name: scs
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  storageClassName: local-storage

Volumes in our production environment (aws)In the production deployment we don t create the PersistentVolume (we are using the aws-efs-csi-driver which supports Dynamic Provisioning) but we add the storageClassName (we set it to the one mapped to the EFS driver, i.e. efs-sc) and set ReadWriteMany as the accessMode:
efs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scs-pvc
  labels:
    app.kubernetes.io/name: scs
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 8Gi
  storageClassName: efs-sc

statefulset.yamlThe definition of the statefulSet is as follows:
statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: scs
  labels:
    app.kubernetes.io/name: scs
spec:
  serviceName: scs
  replicas: 1
  selector:
    matchLabels:
      app: scs
  template:
    metadata:
      labels:
        app: scs
    spec:
      containers:
      - name: nginx
        image: stodh/nginx-scs:latest
        ports:
        - containerPort: 80
          name: http
        env:
        - name: AUTH_REQUEST_URI
          value: ""
        - name: HTML_ROOT
          value: /sftp/data
        volumeMounts:
        - mountPath: /sftp
          name: scs-datadir
      - name: mysecureshell
        image: stodh/mysecureshell:latest
        ports:
        - containerPort: 22
          name: ssh
        securityContext:
          capabilities:
            add:
            - IPC_OWNER
        env:
        - name: SFTP_UID
          value: '2020'
        - name: SFTP_GID
          value: '2020'
        volumeMounts:
        - mountPath: /secrets
          name: scs-file-secrets
          readOnly: true
        - mountPath: /sftp
          name: scs-datadir
      - name: webhook
        image: stodh/webhook-scs:latest
        securityContext:
          privileged: true
        ports:
        - containerPort: 9000
          name: webhook-http
        env:
        - name: WEBHOOK_WORKDIR
          value: /sftp/data/scs
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - mountPath: /sftp
          name: scs-datadir
      volumes:
      - name: devfuse
        hostPath:
          path: /dev/fuse
      - name: scs-file-secrets
        secret:
          secretName: scs-secrets
      - name: scs-datadir
        persistentVolumeClaim:
          claimName: scs-pvc
Notes about the containers:
  • nginx: As this is an example the web server is not using an AUTH_REQUEST_URI and uses the /sftp/data directory as the root of the web (to get to the files uploaded for the scs user we will need to use /scs/ as a prefix on the URLs).
  • mysecureshell: We are adding the IPC_OWNER capability to the container to be able to use some of the sftp-* commands inside it, but they are not really needed, so adding the capability is optional.
  • webhook: We are launching this container in privileged mode to be able to use the s3fs-fuse, as it will not work otherwise for now (see this kubernetes issue); if the functionality is not needed the container can be executed with regular privileges; besides, as we are not enabling public access to this service we don t define *_TOKEN variables (if required the values should be read from a Secret object).
Notes about the volumes:
  • the devfuse volume is only needed if we plan to use the s3fs command on the webhook container, if not we can remove the volume definition and its mounts.

service.yamlTo be able to access the different services on the statefulset we publish the relevant ports using the following Service object:
service.yaml
apiVersion: v1
kind: Service
metadata:
  name: scs-svc
  labels:
    app.kubernetes.io/name: scs
spec:
  ports:
  - name: ssh
    port: 22
    protocol: TCP
    targetPort: 22
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  - name: webhook-http
    port: 9000
    protocol: TCP
    targetPort: 9000
  selector:
    app: scs

ingress.yamlTo download the scs files from the outside we can add an ingress object like the following (the definition is for testing using the localhost name):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: scs-ingress
  labels:
    app.kubernetes.io/name: scs
spec:
  ingressClassName: nginx
  rules:
  - host: 'localhost'
    http:
      paths:
      - path: /scs
        pathType: Prefix
        backend:
          service:
            name: scs-svc
            port:
              number: 80

DeploymentTo deploy the statefulSet we create a namespace and apply the object definitions shown before:
$ kubectl create namespace scs-demo
namespace/scs-demo created
$ kubectl -n scs-demo apply -f secrets.yaml
secret/scs-secrets created
$ kubectl -n scs-demo apply -f pvc.yaml
persistentvolumeclaim/scs-pvc created
$ kubectl -n scs-demo apply -f statefulset.yaml
statefulset.apps/scs created
$ kubectl -n scs-demo apply -f service.yaml
service/scs-svc created
$ kubectl -n scs-demo apply -f ingress.yaml
ingress.networking.k8s.io/scs-ingress created
Once the objects are deployed we can check that all is working using kubectl:
$ kubectl  -n scs-demo get all,secrets,ingress
NAME        READY   STATUS    RESTARTS   AGE
pod/scs-0   3/3     Running   0          24s
NAME            TYPE       CLUSTER-IP  EXTERNAL-IP  PORT(S)                  AGE
service/scs-svc ClusterIP  10.43.0.47  <none>       22/TCP,80/TCP,9000/TCP   21s

NAME                   READY   AGE
statefulset.apps/scs   1/1     24s
NAME                         TYPE                                  DATA   AGE
secret/default-token-mwcd7   kubernetes.io/service-account-token   3      53s
secret/scs-secrets           Opaque                                3      39s
NAME                                   CLASS  HOSTS      ADDRESS     PORTS   AGE
ingress.networking.k8s.io/scs-ingress  nginx  localhost  172.21.0.5  80      17s
At this point we are ready to use the system.

Usage examples

File uploadsAs previously mentioned in our system the idea is to use the sftp server from other Pods, but to test the system we are going to do a kubectl port-forward and connect to the server using our host client and the password we have generated (it is on the user_pass.txt file, inside the users.tar archive):
$ kubectl -n scs-demo port-forward service/scs-svc 2020:22 &
Forwarding from 127.0.0.1:2020 -> 22
Forwarding from [::1]:2020 -> 22
$ PF_PID=$!
$ sftp -P 2020 scs@127.0.0.1                                                 1
Handling connection for 2020
The authenticity of host '[127.0.0.1]:2020 ([127.0.0.1]:2020)' can't be \
  established.
ED25519 key fingerprint is SHA256:eHNwCnyLcSSuVXXiLKeGraw0FT/4Bb/yjfqTstt+088.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '[127.0.0.1]:2020' (ED25519) to the list of known \
  hosts.
scs@127.0.0.1's password: **********
Connected to 127.0.0.1.
sftp> ls -la
drwxr-xr-x    2 sftp     sftp         4096 Sep 25 14:47 .
dr-xr-xr-x    3 sftp     sftp         4096 Sep 25 14:36 ..
sftp> !date -R > /tmp/date.txt                                               2
sftp> put /tmp/date.txt .
Uploading /tmp/date.txt to /date.txt
date.txt                                      100%   32    27.8KB/s   00:00
sftp> ls -l
-rw-r--r--    1 sftp     sftp           32 Sep 25 15:21 date.txt
sftp> ln date.txt date.txt.1                                                 3
sftp> ls -l
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt.1
sftp> put /tmp/date.txt date.txt.2                                           4
Uploading /tmp/date.txt to /date.txt.2
date.txt                                      100%   32    27.8KB/s   00:00
sftp> ls -l                                                                  5
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt.1
-rw-r--r--    1 sftp     sftp           32 Sep 25 15:21 date.txt.2
sftp> exit
$ kill "$PF_PID"
[1]  + terminated  kubectl -n scs-demo port-forward service/scs-svc 2020:22
  1. We connect to the sftp service on the forwarded port with the scs user.
  2. We put a file we have created on the host on the directory.
  3. We do a hard link of the uploaded file.
  4. We put a second copy of the file we created locally.
  5. On the file list we can see that the two first files have two hardlinks

File retrievalsIf our ingress is configured right we can download the date.txt file from the URL http://localhost/scs/date.txt:
$ curl -s http://localhost/scs/date.txt
Sun, 25 Sep 2022 17:21:51 +0200

Use of the webhook containerTo finish this post we are going to show how we can call the hooks directly, from a CronJob and from a Job.

Direct script call (du)In our deployment the direct calls are done from other Pods, to simulate it we are going to do a port-forward and call the script with an existing PATH (the root directory) and a bad one:
$ kubectl -n scs-demo port-forward service/scs-svc 9000:9000 >/dev/null &
$ PF_PID=$!
$ JSON="$(curl -s "http://localhost:9000/hooks/du?path=.")"
$ echo $JSON
 "path":"","bytes":"4160" 
$ JSON="$(curl -s "http://localhost:9000/hooks/du?path=foo")"
$ echo $JSON
 "error":"The provided PATH ('foo') is not a directory" 
$ kill $PF_PID
As we only have files on the base directory we print the disk usage of the . PATH and the output is in json format because we export OUTPUT_FORMAT with the value json on the webhook configuration.

Jobs (s3sync)The following job can be used to synchronise the contents of a directory in a S3 bucket with the SCS Filesystem:
job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: s3sync
  labels:
    cronjob: 's3sync'
spec:
  template:
    metadata:
      labels:
        cronjob: 's3sync'
    spec:
      containers:
      - name: s3sync-job
        image: alpine:latest
        command: 
        - "wget"
        - "-q"
        - "--header"
        - "Content-Type: application/json"
        - "--post-file"
        - "/secrets/s3sync.json"
        - "-O-"
        - "http://scs-svc:9000/hooks/s3sync"
        volumeMounts:
        - mountPath: /secrets
          name: job-secrets
          readOnly: true
      restartPolicy: Never
      volumes:
      - name: job-secrets
        secret:
          secretName: webhook-job-secrets
The file with parameters for the script must be something like this:
s3sync.json
 
  "aws":  
    "key": "********************",
    "secret_key": "****************************************"
   ,
  "s3":  
    "region": "eu-north-1",
    "bucket": "blogops-test",
    "path": "test"
   ,
  "scs":  
    "path": "test"
   
 
Once we have both files we can run the Job as follows:
$ kubectl -n scs-demo create secret generic webhook-job-secrets \            1
  --from-file="s3sync.json=s3sync.json"
secret/webhook-job-secrets created
$ kubectl -n scs-demo apply -f webhook-job.yaml                              2
job.batch/s3sync created
$ kubectl -n scs-demo get pods -l "cronjob=s3sync"                           3
NAME           READY   STATUS      RESTARTS   AGE
s3sync-zx2cj   0/1     Completed   0          12s
$ kubectl -n scs-demo logs s3sync-zx2cj                                      4
Mounted bucket 's3fs-test' on '/root/tmp.jiOjaF/s3data'
sending incremental file list
created directory ./test
./
kyso.png
Number of files: 2 (reg: 1, dir: 1)
Number of created files: 2 (reg: 1, dir: 1)
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 15,075 bytes
Total transferred file size: 15,075 bytes
Literal data: 15,075 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.147 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 15,183
Total bytes received: 74
sent 15,183 bytes  received 74 bytes  30,514.00 bytes/sec
total size is 15,075  speedup is 0.99
Called umount for '/root/tmp.jiOjaF/s3data'
Script exit code: 0
$ kubectl -n scs-demo delete -f webhook-job.yaml                             5
job.batch "s3sync" deleted
$ kubectl -n scs-demo delete secrets webhook-job-secrets                     6
secret "webhook-job-secrets" deleted
  1. Here we create the webhook-job-secrets secret that contains the s3sync.json file.
  2. This command runs the job.
  3. Checking the label cronjob=s3sync we get the Pods executed by the job.
  4. Here we print the logs of the completed job.
  5. Once we are finished we remove the Job.
  6. And also the secret.

Final remarksThis post has been longer than I expected, but I believe it can be useful for someone; in any case, next time I ll try to explain something shorter or will split it into multiple entries.

22 September 2022

Jonathan Dowland: Nine Inch Nails, Cornwall, June

In June I travelled to see Nine Inch Nails perform two nights at the Eden Project in Cornwall. It'd been eight years since I last saw them live and when they announced the Eden shows, I thought it might be the only chance I'd get to see them for a long time. I committed, and sods law, a week or so later they announced a handful of single-night UK club shows. On the other hand, on previous tours where they'd typically book two club nights in each city, I've attended one night and always felt I should have done both, so this time I was making that happen. Newquay
approach by air approach by air
Towan Beach (I think) Towan Beach (I think)
For personal reasons it's been a difficult year so it was nice to treat myself to a mini holiday. I stayed in Newquay, a seaside town with many similarities to the North East coast, as well as many differences. It's much bigger, and although we have a thriving surfing community in Tynemouth, Newquay have it on another level. They also have a lot more tourism, which is a double-edged sword: in Newquay, besides surfing, there was not a lot to do. There's a lot of tourist tat shops, and bars and cafes (som very nice ones), but no book shops, no record shops, very few of the quaint, unique boutique places we enjoy up here and possibly take for granted. If you want tie-dyed t-shirts though, you're sorted. Nine Inch Nails have a long-established, independently fan-run forum called Echoing The Sound. There is now also an official Discord server. I asked on both whether anyone was around in Newquay and wanted to meet up: not many people were! But I did meet a new friend, James, for a quiet drink. He was due to share a taxi with Sarah, who was flying in but her flight was delayed and she had to figure out another route. Eden Project
the Eden Project the Eden Project
The Eden Project, the venue itself, is a fascinating place. I didn't realise until I'd planned most of my time there that the gig tickets granted you free entry into the Project on the day of the gig as well as the day after. It was quite tricky to get from Newquay to the Eden project, I would have been better off staying in St Austell itself perhaps, so I didn't take advantage of this, but I did have a couple of hours total to explore a little bit at the venue before the gig on each night. Friday 17th (sunny) Once I got to the venue I managed to meet up with several names from ETS and the Discord: James, Sarah (who managed to re-arrange flights), Pete and his wife (sorry I missed your name), Via Tenebrosa (she of crab hat fame), Dave (DaveDiablo), Elliot and his sister and finally James (sheapdean), someone who I've been talking to online for over a decade and finally met in person (and who taped both shows). I also tried to meet up with a friend from the Debian UK community (hi Lief) but I couldn't find him! Support for Friday was Nitzer Ebb, who I wasn't familiar with before. There were two men on stage, one operating instruments, the other singing. It was a tough time to warm up the crowd, the venue was still very empty and it was very bright and sunny, but I enjoyed what I was hearing. They're definitely on my list. I later learned that the band's regular singer (Doug McCarthy) was unable to make it, and so the guy I was watching (Bon Harris) was standing in for full vocal duties. This made the performance (and their subsequent one at Hellfest the week after) all the more impressive.
pic of the band
Via (with crab hat), Sarah, me (behind). pic by kraw Via (with crab hat), Sarah, me (behind). pic by kraw
(Day) and night one, Thursday, was very hot and sunny and the band seemed a little uncomfortable exposed on stage with little cover. Trent commented as such at least once. The setlist was eclectic: and I finally heard some of my white whale songs. Highlights for me were The Perfect Drug, which was unplayed from 1997-2018 and has now become a staple, and the second ever performance of Everything, the first being a few days earlier. Also notable was three cuts in a row from the last LP, Bad Witch, Heresy and Love Is Not Enough. Saturday 18th (rain)
with Elliot, before with Elliot, before
Day/night 2, Friday, was rainy all day. Support was Yves Tumor, who were an interesting clash of styles: a Prince/Bowie-esque inspired lead clashing with a rock-out lead guitarist styling himself similarly to Brian May. I managed to find Sarah, Elliot (new gig best-buddy), Via and James (sheapdean) again. Pete was at this gig too, but opted to take a more relaxed position than the rail this time. I also spent a lot of time talking to a Canadian guy on a press pass (both nights) that I'm ashamed to have forgotten his name. The dank weather had Nine Inch Nails in their element. I think night one had the more interesting setlist, but night two had the best performance, hands down. Highlights for me were mostly a string of heavier songs (in rough order of scarcity, from common to rarely played): wish, burn, letting you, reptile, every day is exactly the same, the line begins to blur, and finally, happiness in slavery, the first UK performance since 1994. This was a crushing set. A girl in front of me was really suffering with the cold and rain after waiting at the venue all day to get a position on the rail. I thought she was going to pass out. A roadie with NIN noticed, and came over and gave her his jacket. He said if she waited to the end of the show and returned his jacket he'd give her a setlist, and true to his word, he did. This was a really nice thing to happen and really gave the impression that the folks who work on these shows are caring people.
Yep I was this close Yep I was this close
A fuckin' rainbow! Photo by "Lazereth of Nazereth"
Afterwards Afterwards
Night two did have some gentler songs and moments to remember: a re-arranged Sanctified (which ended a nineteen-year hiatus in 2013) And All That Could Have Been (recorded 2002, first played 2018), La Mer, during which the rain broke and we were presented with a beautiful pink-hued rainbow. They then segued into Less Than, providing the comic moment of the night when Trent noticed the rainbow mid-song; now a meme that will go down in NIN fan history. Wrap-up This was a blow-out, once in a lifetime trip to go and see a band who are at the top of their career in terms of performance. One problem I've had with NIN gigs in the past is suffering gig flashback to them when I go to other (inferior) gigs afterwards, and I'm pretty sure I will have this problem again. Doing both nights was worth it, the two experiences were very different and each had its own unique moments. The venue was incredible, and Cornwall is (modulo tourist trap stuff) beautiful.

28 August 2022

Dirk Eddelbuettel: littler 0.3.16 on CRAN: Package Updates

max-heap image The seventeenth release of littler as a CRAN package just landed, following in the now sixteen year history (!!) as a package started by Jeff in 2006, and joined by me a few weeks later. littler is the first command-line interface for R as it predates Rscript. It allows for piping as well for shebang scripting via #!, uses command-line arguments more consistently and still starts faster. It also always loaded the methods package which Rscript only started to do in recent years. littler lives on Linux and Unix, has its difficulties on macOS due to yet-another-braindeadedness there (who ever thought case-insensitive filesystems as a default were a good idea?) and simply does not exist on Windows (yet the build system could be extended see RInside for an existence proof, and volunteers are welcome!). See the FAQ vignette on how to add it to your PATH. A few examples are highlighted at the Github repo, as well as in the examples vignette. This release, the first since last December, further extends install2.r accept multiple repos options thanks to Tatsuya Shima, overhauls and substantially extends installBioc.r thanks to Pieter Moris, and includes a number of (generally smaller) changes I added (see below). The full change description follows.

Changes in littler version 0.3.16 (2022-08-28)
  • Changes in package
    • The configure code checks for two more headers
    • The RNG seeding matches the current version in R (Dirk)
  • Changes in examples
    • A cowu.r 'check Window UCRT' helper was added (Dirk)
    • A getPandoc.r downloader has been added (Dirk)
    • The -r option tp install2.r has been generalzed (Tatsuya Shima in #95)
    • The rcc.r code / package checker now has valgrind option (Dirk)
    • install2.r now installs to first element in .libPaths() by default (Dirk)
    • A very simple r2u.r help has been added (Dirk)
    • The installBioc.r has been generalized and extended similar to install2.r (Pieter Moris in #103)

My CRANberries service provides a comparison to the previous release. Full details for the littler release are provided as usual at the ChangeLog page, and also on the package docs website. The code is available via the GitHub repo, from tarballs and now of course also from its CRAN page and via install.packages("littler"). Binary packages are available directly in Debian as well as soon via Ubuntu binaries at CRAN thanks to the tireless Michael Rutter. Comments and suggestions are welcome at the GitHub repo. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

26 August 2022

Antoine Beaupr : How to nationalize the internet in Canada

Rogers had a catastrophic failure in July 2022. It affected emergency services (as in: people couldn't call 911, but also some 911 services themselves failed), hospitals (which couldn't access prescriptions), banks and payment systems (as payment terminals stopped working), and regular users as well. The outage lasted almost a full day, and Rogers took days to give any technical explanation on the outage, and even when they did, details were sparse. So far the only detailed account is from outside actors like Cloudflare which seem to point at an internal BGP failure. Its impact on the economy has yet to be measured, but it probably cost millions of dollars in wasted time and possibly lead to life-threatening situations. Apart from holding Rogers (criminally?) responsible for this, what should be done in the future to avoid such problems? It's not the first time something like this has happened: it happened to Bell Canada as well. The Rogers outage is also strangely similar to the Facebook outage last year, but, to its credit, Facebook did post a fairly detailed explanation only a day later. The internet is designed to be decentralised, and having large companies like Rogers hold so much power is a crucial mistake that should be reverted. The question is how. Some critics were quick to point out that we need more ISP diversity and competition, but I think that's missing the point. Others have suggested that the internet should be a public good or even straight out nationalized. I believe the solution to the problem of large, private, centralised telcos and ISPs is to replace them with smaller, public, decentralised service providers. The only way to ensure that works is to make sure that public money ends up creating infrastructure controlled by the public, which means treating ISPs as a public utility. This has been implemented elsewhere: it works, it's cheaper, and provides better service.

A modest proposal Global wireless services (like phone services) and home internet inevitably grow into monopolies. They are public utilities, just like water, power, railways, and roads. The question of how they should be managed is therefore inherently political, yet people don't seem to question the idea that only the market (i.e. "competition") can solve this problem. I disagree. 10 years ago (in french), I suggested we, in Qu bec, should nationalize large telcos and internet service providers. I no longer believe is a realistic approach: most of those companies have crap copper-based networks (at least for the last mile), yet are worth billions of dollars. It would be prohibitive, and a waste, to buy them out. Back then, I called this idea "R seau-Qu bec", a reference to the already nationalized power company, Hydro-Qu bec. (This idea, incidentally, made it into the plan of a political party.) Now, I think we should instead build our own, public internet. Start setting up municipal internet services, fiber to the home in all cities, progressively. Then interconnect cities with fiber, and build peering agreements with other providers. This also includes a bid on wireless spectrum to start competing with phone providers as well. And while that sounds really ambitious, I think it's possible to take this one step at a time.

Municipal broadband In many parts of the world, municipal broadband is an elegant solution to the problem, with solutions ranging from Stockholm's city-owned fiber network (dark fiber, layer 1) to Utah's UTOPIA network (fiber to the premises, layer 2) and municipal wireless networks like Guifi.net which connects about 40,000 nodes in Catalonia. A good first step would be for cities to start providing broadband services to its residents, directly. Cities normally own sewage and water systems that interconnect most residences and therefore have direct physical access everywhere. In Montr al, in particular, there is an ongoing project to replace a lot of old lead-based plumbing which would give an opportunity to lay down a wired fiber network across the city. This is a wild guess, but I suspect this would be much less expensive than one would think. Some people agree with me and quote this as low as 1000$ per household. There is about 800,000 households in the city of Montr al, so we're talking about a 800 million dollars investment here, to connect every household in Montr al with fiber and incidentally a quarter of the province's population. And this is not an up-front cost: this can be built progressively, with expenses amortized over many years. (We should not, however, connect Montr al first: it's used as an example here because it's a large number of households to connect.) Such a network should be built with a redundant topology. I leave it as an open question whether we should adopt Stockholm's more minimalist approach or provide direct IP connectivity. I would tend to favor the latter, because then you can immediately start to offer the service to households and generate revenues to compensate for the capital expenditures. Given the ridiculous profit margins telcos currently have 8 billion $CAD net income for BCE (2019), 2 billion $CAD for Rogers (2020) I also believe this would actually turn into a profitable revenue stream for the city, the same way Hydro-Qu bec is more and more considered as a revenue stream for the state. (I personally believe that's actually wrong and we should treat those resources as human rights and not money cows, but I digress. The point is: this is not a cost point, it's a revenue.) The other major challenge here is that the city will need competent engineers to drive this project forward. But this is not different from the way other public utilities run: we have electrical engineers at Hydro, sewer and water engineers at the city, this is just another profession. If anything, the computing science sector might be more at fault than the city here in its failure to provide competent and accountable engineers to society... Right now, most of the network in Canada is copper: we are hitting the limits of that technology with DSL, and while cable has some life left to it (DOCSIS 4.0 does 4Gbps), that is nowhere near the capacity of fiber. Take the town of Chattanooga, Tennessee: in 2010, the city-owned ISP EPB finished deploying a fiber network to the entire town and provided gigabit internet to everyone. Now, 12 years later, they are using this same network to provide the mind-boggling speed of 25 gigabit to the home. To give you an idea, Chattanooga is roughly the size and density of Sherbrooke.

Provincial public internet As part of building a municipal network, the question of getting access to "the internet" will immediately come up. Naturally, this will first be solved by using already existing commercial providers to hook up residents to the rest of the global network. But eventually, networks should inter-connect: Montr al should connect with Laval, and then Trois-Rivi res, then Qu bec City. This will require long haul fiber runs, but those links are not actually that expensive, and many of those already exist as a public resource at RISQ and CANARIE, which cross-connects universities and colleges across the province and the country. Those networks might not have the capacity to cover the needs of the entire province right now, but that is a router upgrade away, thanks to the amazing capacity of fiber. There are two crucial mistakes to avoid at this point. First, the network needs to remain decentralised. Long haul links should be IP links with BGP sessions, and each city (or MRC) should have its own independent network, to avoid Rogers-class catastrophic failures. Second, skill needs to remain in-house: RISQ has already made that mistake, to a certain extent, by selling its neutral datacenter. Tellingly, MetroOptic, probably the largest commercial dark fiber provider in the province, now operates the QIX, the second largest "public" internet exchange in Canada. Still, we have a lot of infrastructure we can leverage here. If RISQ or CANARIE cannot be up to the task, Hydro-Qu bec has power lines running into every house in the province, with high voltage power lines running hundreds of kilometers far north. The logistics of long distance maintenance are already solved by that institution. In fact, Hydro already has fiber all over the province, but it is a private network, separate from the internet for security reasons (and that should probably remain so). But this only shows they already have the expertise to lay down fiber: they would just need to lay down a parallel network to the existing one. In that architecture, Hydro would be a "dark fiber" provider.

International public internet None of the above solves the problem for the entire population of Qu bec, which is notoriously dispersed, with an area three times the size of France, but with only an eight of its population (8 million vs 67). More specifically, Canada was originally a french colony, a land violently stolen from native people who have lived here for thousands of years. Some of those people now live in reservations, sometimes far from urban centers (but definitely not always). So the idea of leveraging the Hydro-Qu bec infrastructure doesn't always work to solve this, because while Hydro will happily flood a traditional hunting territory for an electric dam, they don't bother running power lines to the village they forcibly moved, powering it instead with noisy and polluting diesel generators. So before giving me fiber to the home, we should give power (and potable water, for that matter), to those communities first. So we need to discuss international connectivity. (How else could we consider those communities than peer nations anyways?c) Qu bec has virtually zero international links. Even in Montr al, which likes to style itself a major player in gaming, AI, and technology, most peering goes through either Toronto or New York. That's a problem that we must fix, regardless of the other problems stated here. Looking at the submarine cable map, we see very few international links actually landing in Canada. There is the Greenland connect which connects Newfoundland to Iceland through Greenland. There's the EXA which lands in Ireland, the UK and the US, and Google has the Topaz link on the west coast. That's about it, and none of those land anywhere near any major urban center in Qu bec. We should have a cable running from France up to Saint-F licien. There should be a cable from Vancouver to China. Heck, there should be a fiber cable running all the way from the end of the great lakes through Qu bec, then up around the northern passage and back down to British Columbia. Those cables are expensive, and the idea might sound ludicrous, but Russia is actually planning such a project for 2026. The US has cables running all the way up (and around!) Alaska, neatly bypassing all of Canada in the process. We just look ridiculous on that map. (Addendum: I somehow forgot to talk about Teleglobe here was founded as publicly owned company in 1950, growing international phone and (later) data links all over the world. It was privatized by the conservatives in 1984, along with rails and other "crown corporations". So that's one major risk to any effort to make public utilities work properly: some government might be elected and promptly sell it out to its friends for peanuts.)

Wireless networks I know most people will have rolled their eyes so far back their heads have exploded. But I'm not done yet. I want wireless too. And by wireless, I don't mean a bunch of geeks setting up OpenWRT routers on rooftops. I tried that, and while it was fun and educational, it didn't scale. A public networking utility wouldn't be complete without providing cellular phone service. This involves bidding for frequencies at the federal level, and deploying a rather large amount of infrastructure, but it could be a later phase, when the engineers and politicians have proven their worth. At least part of the Rogers fiasco would have been averted if such a decentralized network backend existed. One might even want to argue that a separate institution should be setup to provide phone services, independently from the regular wired networking, if only for reliability. Because remember here: the problem we're trying to solve is not just technical, it's about political boundaries, centralisation, and automation. If everything is ran by this one organisation again, we will have failed. However, I must admit that phone services is where my ideas fall a little short. I can't help but think it's also an accessible goal maybe starting with a virtual operator but it seems slightly less so than the others, especially considering how closed the phone ecosystem is.

Counter points In debating these ideas while writing this article, the following objections came up.

I don't want the state to control my internet One legitimate concern I have about the idea of the state running the internet is the potential it would have to censor or control the content running over the wires. But I don't think there is necessarily a direct relationship between resource ownership and control of content. Sure, China has strong censorship in place, partly implemented through state-controlled businesses. But Russia also has strong censorship in place, based on regulatory tools: they force private service providers to install back-doors in their networks to control content and surveil their users. Besides, the USA have been doing warrantless wiretapping since at least 2003 (and yes, that's 10 years before the Snowden revelations) so a commercial internet is no assurance that we have a free internet. Quite the contrary in fact: if anything, the commercial internet goes hand in hand with the neo-colonial internet, just like businesses did in the "good old colonial days". Large media companies are the primary censors of content here. In Canada, the media cartel requested the first site-blocking order in 2018. The plaintiffs (including Qu becor, Rogers, and Bell Canada) are both content providers and internet service providers, an obvious conflict of interest. Nevertheless, there are some strong arguments against having a centralised, state-owned monopoly on internet service providers. FDN makes a good point on this. But this is not what I am suggesting: at the provincial level, the network would be purely physical, and regional entities (which could include private companies) would peer over that physical network, ensuring decentralization. Delegating the management of that infrastructure to an independent non-profit or cooperative (but owned by the state) would also ensure some level of independence.

Isn't the government incompetent and corrupt? Also known as "private enterprise is better skilled at handling this, the state can't do anything right" I don't think this is a "fait accomplit". If anything, I have found publicly ran utilities to be spectacularly reliable here. I rarely have trouble with sewage, water, or power, and keep in mind I live in a city where we receive about 2 meters of snow a year, which tend to create lots of trouble with power lines. Unless there's a major weather event, power just runs here. I think the same can happen with an internet service provider. But it would certainly need to have higher standards to what we're used to, because frankly Internet is kind of janky.

A single monopoly will be less reliable I actually agree with that, but that is not what I am proposing anyways. Current commercial or non-profit entities will be free to offer their services on top of the public network. And besides, the current "ha! diversity is great" approach is exactly what we have now, and it's not working. The pretense that we can have competition over a single network is what led the US into the ridiculous situation where they also pretend to have competition over the power utility market. This led to massive forest fires in California and major power outages in Texas. It doesn't work.

Wouldn't this create an isolated network? One theory is that this new network would be so hostile to incumbent telcos and ISPs that they would simply refuse to network with the public utility. And while it is true that the telcos currently do also act as a kind of "tier one" provider in some places, I strongly feel this is also a problem that needs to be solved, regardless of ownership of networking infrastructure. Right now, telcos often hold both ends of the stick: they are the gateway to users, the "last mile", but they also provide peering to the larger internet in some locations. In at least one datacenter in downtown Montr al, I've seen traffic go through Bell Canada that was not directly targeted at Bell customers. So in effect, they are in a position of charging twice for the same traffic, and that's not only ridiculous, it should just be plain illegal. And besides, this is not a big problem: there are other providers out there. As bad as the market is in Qu bec, there is still some diversity in Tier one providers that could allow for some exits to the wider network (e.g. yes, Cogent is here too).

What about Google and Facebook? Nationalization of other service providers like Google and Facebook is out of scope of this discussion. That said, I am not sure the state should get into the business of organising the web or providing content services however, but I will point out it already does do some of that through its own websites. It should probably keep itself to this, and also consider providing normal services for people who don't or can't access the internet. (And I would also be ready to argue that Google and Facebook already act as extensions of the state: certainly if Facebook didn't exist, the CIA or the NSA would like to create it at this point. And Google has lucrative business with the US department of defense.)

What does not work So we've seen one thing that could work. Maybe it's too expensive. Maybe the political will isn't there. Maybe it will fail. We don't know yet. But we know what does not work, and it's what we've been doing ever since the internet has gone commercial.

Subsidies The absurd price we pay for data does not actually mean everyone gets high speed internet at home. Large swathes of the Qu bec countryside don't get broadband at all, and it can be difficult or expensive, even in large urban centers like Montr al, to get high speed internet. That is despite having a series of subsidies that all avoided investing in our own infrastructure. We had the "fonds de l'autoroute de l'information", "information highway fund" (site dead since 2003, archive.org link) and "branchez les familles", "connecting families" (site dead since 2003, archive.org link) which subsidized the development of a copper network. In 2014, more of the same: the federal government poured hundreds of millions of dollars into a program called connecting Canadians to connect 280 000 households to "high speed internet". And now, the federal and provincial governments are proudly announcing that "everyone is now connected to high speed internet", after pouring more than 1.1 billion dollars to connect, guess what, another 380 000 homes, right in time for the provincial election. Of course, technically, the deadline won't actually be met until 2023. Qu bec is a big area to cover, and you can guess what happens next: the telcos threw up their hand and said some areas just can't be connected. (Or they connect their CEO but not the poor folks across the lake.) The story then takes the predictable twist of giving more money out to billionaires, subsidizing now Musk's Starlink system to connect those remote areas. To give a concrete example: a friend who lives about 1000km away from Montr al, 4km from a small, 2500 habitant village, has recently got symmetric 100 mbps fiber at home from Telus, thanks to those subsidies. But I can't get that service in Montr al at all, presumably because Telus and Bell colluded to split that market. Bell doesn't provide me with such a service either: they tell me they have "fiber to my neighborhood", and only offer me a 25/10 mbps ADSL service. (There is Vid otron offering 400mbps, but that's copper cable, again a dead technology, and asymmetric.)

Conclusion Remember Chattanooga? Back in 2010, they funded the development of a fiber network, and now they have deployed a network roughly a thousand times faster than what we have just funded with a billion dollars. In 2010, I was paying Bell Canada 60$/mth for 20mbps and a 125GB cap, and now, I'm still (indirectly) paying Bell for roughly the same speed (25mbps). Back then, Bell was throttling their competitors networks until 2009, when they were forced by the CRTC to stop throttling. Both Bell and Vid otron still explicitly forbid you from running your own servers at home, Vid otron charges prohibitive prices which make it near impossible for resellers to sell uncapped services. Those companies are not spurring innovation: they are blocking it. We have spent all this money for the private sector to build us a private internet, over decades, without any assurance of quality, equity or reliability. And while in some locations, ISPs did deploy fiber to the home, they certainly didn't upgrade their entire network to follow suit, and even less allowed resellers to compete on that network. In 10 years, when 100mbps will be laughable, I bet those service providers will again punt the ball in the public courtyard and tell us they don't have the money to upgrade everyone's equipment. We got screwed. It's time to try something new.

Updates There was a discussion about this article on Hacker News which was surprisingly productive. Trigger warning: Hacker News is kind of right-wing, in case you didn't know. Since this article was written, at least two more major acquisitions happened, just in Qu bec: In the latter case, vMedia was explicitly saying it couldn't grow because of "lack of access to capital". So basically, we have given those companies a billion dollars, and they are not using that very money to buy out their competition. At least we could have given that money to small players to even out the playing field. But this is not how that works at all. Also, in a bizarre twist, an "analyst" believes the acquisition is likely to help Rogers acquire Shaw. Also, since this article was written, the Washington Post published a review of a book bringing similar ideas: Internet for the People The Fight for Our Digital Future, by Ben Tarnoff, at Verso books. It's short, but even more ambitious than what I am suggesting in this article, arguing that all big tech companies should be broken up and better regulated:
He pulls from Ethan Zuckerman s idea of a web that is plural in purpose that just as pool halls, libraries and churches each have different norms, purposes and designs, so too should different places on the internet. To achieve this, Tarnoff wants governments to pass laws that would make the big platforms unprofitable and, in their place, fund small-scale, local experiments in social media design. Instead of having platforms ruled by engagement-maximizing algorithms, Tarnoff imagines public platforms run by local librarians that include content from public media.
(Links mine: the Washington Post obviously prefers to not link to the real web, and instead doesn't link to Zuckerman's site all and suggests Amazon for the book, in a cynical example.) And in another example of how the private sector has failed us, there was recently a fluke in the AMBER alert system where the entire province was warned about a loose shooter in Saint-Elz ar except the people in the town, because they have spotty cell phone coverage. In other words, millions of people received a strongly toned, "life-threatening", alert for a city sometimes hours away, except the people most vulnerable to the alert. Not missing a beat, the CAQ party is promising more of the same medicine again and giving more money to telcos to fix the problem, suggesting to spend three billion dollars in private infrastructure.

15 August 2022

Jonathan Dowland: Temperature monitoring

Xiaomi Mijia Temperature Sensor Xiaomi Mijia Temperature Sensor
I've been having some temperature problems in my house, so I wanted to set up some thermometers which I could read from a computer, and look at trends. I bought a pack of three cheap Xiaomi IoT thermometers. There's some official Xiaomi tooling to access them from smartphones and suchlike, but I wanted something more open. The thermometers have some rudimentary security on them to try and ensure you use the official tooling. This is pretty weak, and the open-source Home Assistant (HA) has support for querying them. I wasn't already running HA and it looked to do more than I needed right now. gathering A friend told me that it was trivial to write custom firmware to the devices. It's so easy you can do it from a web-based flasher: in fact, anyone in range can. There's a family of custom firmwares out there, and most move the sensors readings into the BTLE announce packets, meaning, you can scrape the temperature by simply reading and decoding the announcement packets, no need to handshake at all, and certainly no need to navigate Xiaomi's weird security. This is the one I used. I hacked up a Python script to read the values with the help of this convenience library1. Next, I needed to set up somewhere to write the values. reporting
The study is thankfully cooler today The study is thankfully cooler today
It's been long enough since I last looked at something like this that the best in class software was things like multi router traffic grapher, and rrdtool, or things that build on top of them like Munin. The world seems to have moved on (rightly or wrongly) with a cornucopia of options like Prometheus, Grafana, Graphite/Carbon, InfluxDB, statsd, etc. I ruled most of these out as being too complex for what I want to do, and got something working with Graphite (front-end) and Carbon (back-end). I was surprised that this wasn't packaged in Debian, and opted to try the Docker container. This works, although even that is more complex than I need: it's got graphite and carbon, but also nginx and statsd too; I'm submitting directly to carbon, side-stepping statsd entirely. So as I refine what I'm doing I might possibly strip that back. next steps I might add more sensors in my house! My scripts also need a lot of tidying up. But, I think it would be useful to add some external temperature data, such as something from a Weather service. I am also considering pulling in some of the sensor data from the Newcastle University Urban Observatory, which is something I looked at a while ago for my PhD but didn't ultimately end up using. There are several temperature sensors nearby, but they seem to operate relatively sporadically. There's a load of other interesting sensors in my vicinity, such as air quality monitors. I'm currently ignoring the humidity data from the sensors but I should collect that too. It would be useful to mark relevant "events", too: does switching on or off my desktop PC, or printer, etc. correlate to a jump in temperature?

  1. I might get rid of that in the future as I refine my approach

29 June 2022

Aigars Mahinovs: Long travel in an electric car

Since the first week of April 2022 I have (finally!) changed my company car from a plug-in hybrid to a fully electic car. My new ride, for the next two years, is a BMW i4 M50 in Aventurine Red metallic. An ellegant car with very deep and memorable color, insanely powerful (544 hp/795 Nm), sub-4 second 0-100 km/h, large 84 kWh battery (80 kWh usable), charging up to 210 kW, top speed of 225 km/h and also very efficient (which came out best in this trip) with WLTP range of 510 km and EVDB real range of 435 km. The car also has performance tyres (Hankook Ventus S1 evo3 245/45R18 100Y XL in front and 255/45R18 103Y XL in rear all at recommended 2.5 bar) that have reduced efficiency. So I wanted to document and describe how was it for me to travel ~2000 km (one way) with this, electric, car from south of Germany to north of Latvia. I have done this trip many times before since I live in Germany now and travel back to my relatives in Latvia 1-2 times per year. This was the first time I made this trip in an electric car. And as this trip includes both travelling in Germany (where BEV infrastructure is best in the world) and across Eastern/Northen Europe, I believe that this can be interesting to a few people out there. Normally when I travelled this trip with a gasoline/diesel car I would normally drive for two days with an intermediate stop somewhere around Warsaw with about 12 hours of travel time in each day. This would normally include a couple bathroom stops in each day, at least one longer lunch stop and 3-4 refueling stops on top of that. Normally this would use at least 6 liters of fuel per 100 km on average with total usage of about 270 liters for the whole trip (or about 540 just in fuel costs, nowadays). My (personal) quirk is that both fuel and recharging of my (business) car inside Germany is actually paid by my employer, so it is useful for me to charge up (or fill up) at the last station in Gemany before driving on. The plan for this trip was made in a similar way as when travelling with a gasoline car: travelling as fast as possible on German Autobahn network to last chargin stop on the A4 near G rlitz, there charging up as much as reasonable and then travelling to a hotel in Warsaw, charging there overnight and travelling north towards Ionity chargers in Lithuania from where reaching the final target in north of Latvia should be possible. How did this plan meet the reality? Travelling inside Germany with an electric car was basically perfect. The most efficient way would involve driving fast and hard with top speed of even 180 km/h (where possible due to speed limits and traffic). BMW i4 is very efficient at high speeds with consumption maxing out at 28 kWh/100km when you actually drive at this speed all the time. In real situation in this trip we saw consumption of 20.8-22.2 kWh/100km in the first legs of the trip. The more traffic there is, the more speed limits and roadworks, the lower is the average speed and also the lower the consumption. With this kind of consumption we could comfortably drive 2 hours as fast as we could and then pick any fast charger along the route and in 26 minutes at a charger (50 kWh charged total) we'd be ready to drive for another 2 hours. This lines up very well with recommended rest stops for biological reasons (bathroom, water or coffee, a bit of movement to get blood circulating) and very close to what I had to do anyway with a gasoline car. With a gasoline car I had to refuel first, then park, then go to bathroom and so on. With an electric car I can do all of that while the car is charging and in the end the total time for a stop is very similar. Also not that there was a crazy heat wave going on and temperature outside was at about 34C minimum the whole day and hitting 40C at one point of the trip, so a lot of power was used for cooling. The car has a heat pump standard, but it still was working hard to keep us cool in the sun. The car was able to plan a charging route with all the charging stops required and had all the good options (like multiple intermediate stops) that many other cars (hi Tesla) and mobile apps (hi Google and Apple) do not have yet. There are a couple bugs with charging route and display of current route guidance, those are already fixed and will be delivered with over the air update with July 2022 update. Another good alterantive is the ABRP (A Better Route Planner) that was specifically designed for electric car routing along the best route for charging. Most phone apps (like Google Maps) have no idea about your specific electric car - it has no idea about the battery capacity, charging curve and is missing key live data as well - what is the current consumption and remaining energy in the battery. ABRP is different - it has data and profiles for almost all electric cars and can also be linked to live vehicle data, either via a OBD dongle or via a new Tronity cloud service. Tronity reads data from vehicle-specific cloud service, such as MyBMW service, saves it, tracks history and also re-transmits it to ABRP for live navigation planning. ABRP allows for options and settings that no car or app offers, for example, saying that you want to stop at a particular place for an hour or until battery is charged to 90%, or saying that you have specific charging cards and would only want to stop at chargers that support those. Both the car and the ABRP also support alternate routes even with multiple intermediate stops. In comparison, route planning by Google Maps or Apple Maps or Waze or even Tesla does not really come close. After charging up in the last German fast charger, a more interesting part of the trip started. In Poland the density of high performance chargers (HPC) is much lower than in Germany. There are many chargers (west of Warsaw), but vast majority of them are (relatively) slow 50kW chargers. And that is a difference between putting 50kWh into the car in 23-26 minutes or in 60 minutes. It does not seem too much, but the key bit here is that for 20 minutes there is easy to find stuff that should be done anyway, but after that you are done and you are just waiting for the car and if that takes 4 more minutes or 40 more minutes is a big, perceptual, difference. So using HPC is much, much preferable. So we put in the Ionity charger near Lodz as our intermediate target and the car suggested an intermediate stop at a Greenway charger by Katy Wroclawskie. The location is a bit weird - it has 4 charging stations with 150 kW each. The weird bits are that each station has two CCS connectors, but only one parking place (and the connectors share power, so if two cars were to connect, each would get half power). Also from the front of the location one can only see two stations, the otehr two are semi-hidden around a corner. We actually missed them on the way to Latvia and one person actually waited for the charger behind us for about 10 minutes. We only discovered the other two stations on the way back. With slower speeds in Poland the consumption goes down to 18 kWh/100km which translates to now up to 3 hours driving between stops. At the end of the first day we drove istarting from Ulm from 9:30 in the morning until about 23:00 in the evening with total distance of about 1100 km, 5 charging stops, starting with 92% battery, charging for 26 min (50 kWh), 33 min (57 kWh + lunch), 17 min (23 kWh), 12 min (17 kWh) and 13 min (37 kW). In the last two chargers you can see the difference between a good and fast 150 kW charger at high battery charge level and a really fast Ionity charger at low battery charge level, which makes charging faster still. Arriving to hotel with 23% of battery. Overnight the car charged from a Porsche Destination Charger to 87% (57 kWh). That was a bit less than I would expect from a full power 11kW charger, but good enough. Hotels should really install 11kW Type2 chargers for their guests, it is a really significant bonus that drives more clients to you. The road between Warsaw and Kaunas is the most difficult part of the trip for both driving itself and also for charging. For driving the problem is that there will be a new highway going from Warsaw to Lithuanian border, but it is actually not fully ready yet. So parts of the way one drives on the new, great and wide highway and parts of the way one drives on temporary roads or on old single lane undivided roads. And the most annoying part is navigating between parts as signs are not always clear and the maps are either too old or too new. Some maps do not have the new roads and others have on the roads that have not been actually build or opened to traffic yet. It's really easy to loose ones way and take a significant detour. As far as charging goes, basically there is only the slow 50 kW chargers between Warsaw and Kaunas (for now). We chose to charge on the last charger in Poland, by Suwalki Kaufland. That was not a good idea - there is only one 50 kW CCS and many people decide the same, so there can be a wait. We had to wait 17 minutes before we could charge for 30 more minutes just to get 18 kWh into the battery. Not the best use of time. On the way back we chose a different charger in Lomza where would have a relaxed dinner while the car was charging. That was far more relaxing and a better use of time. We also tried charging at an Orlen charger that was not recommended by our car and we found out why. Unlike all other chargers during our entire trip, this charger did not accept our universal BMW Charging RFID card. Instead it demanded that we download their own Orlen app and register there. The app is only available in some countries (and not in others) and on iPhone it is only available in Polish. That is a bad exception to the rule and a bad example. This is also how most charging works in USA. Here in Europe that is not normal. The normal is to use a charging card - either provided from the car maker or from another supplier (like PlugSufring or Maingau Energy). The providers then make roaming arrangements with all the charging networks, so the cards just work everywhere. In the end the user gets the prices and the bills from their card provider as a single monthly bill. This also saves all any credit card charges for the user. Having a clear, separate RFID card also means that one can easily choose how to pay for each charging session. For example, I have a corporate RFID card that my company pays for (for charging in Germany) and a private BMW Charging card that I am paying myself for (for charging abroad). Having the car itself authenticate direct with the charger (like Tesla does) removes the option to choose how to pay. Having each charge network have to use their own app or token bring too much chaos and takes too much setup. The optimum is having one card that works everywhere and having the option to have additional card or cards for specific purposes. Reaching Ionity chargers in Lithuania is again a breath of fresh air - 20-24 minutes to charge 50 kWh is as expected. One can charge on the first Ionity just enough to reach the next one and then on the second charger one can charge up enough to either reach the Ionity charger in Adazi or the final target in Latvia. There is a huge number of CSDD (Road Traffic and Safety Directorate) managed chargers all over Latvia, but they are 50 kW chargers. Good enough for local travel, but not great for long distance trips. BMW i4 charges at over 50 kW on a HPC even at over 90% battery state of charge (SoC). This means that it is always faster to charge up in a HPC than in a 50 kW charger, if that is at all possible. We also tested the CSDD chargers - they worked without any issues. One could pay with the BMW Charging RFID card, one could use the CSDD e-mobi app or token and one could also use Mobilly - an app that you can use in Latvia for everything from parking to public transport tickets or museums or car washes. We managed to reach our final destination near Aluksne with 17% range remaining after just 3 charging stops: 17+30 min (18 kWh), 24 min (48 kWh), 28 min (36 kWh). Last stop we charged to 90% which took a few extra minutes that would have been optimal. For travel around in Latvia we were charging at our target farmhouse from a normal 3 kW Schuko EU socket. That is very slow. We charged for 33 hours and went from 17% to 94%, so not really full. That was perfectly fine for our purposes. We easily reached Riga, drove to the sea and then back to Aluksne with 8% still in reserve and started charging again for the next trip. If it were required to drive around more and charge faster, we could have used the normal 3-phase 440V connection in the farmhouse to have a red CEE 16A plug installed (same as people use for welders). BMW i4 comes standard with a new BMW Flexible Fast Charger that has changable socket adapters. It comes by default with a Schucko connector in Europe, but for 90 one can buy an adapter for blue CEE plug (3.7 kW) or red CEE 16A or 32A plugs (11 kW). Some public charging stations in France actually use the blue CEE plugs instead of more common Type2 electric car charging stations. The CEE plugs are also common in camping parking places. On the way back the long distance BEV travel was already well understood and did not cause us any problem. From our destination we could easily reach the first Ionity in Lithuania, on the Panevezhis bypass road where in just 8 minutes we got 19 kWh and were ready to drive on to Kaunas, there a longer 32 minute stop before the charging desert of Suwalki Gap that gave us 52 kWh to 90%. That brought us to a shopping mall in Lomzha where we had some food and charged up 39 kWh in lazy 50 minutes. That was enough to bring us to our return hotel for the night - Hotel 500W in Strykow by Lodz that has a 50kW charger on site, while we were having late dinner and preparing for sleep, the car easily recharged to full (71 kWh in 95 minutes), so I just moved it from charger to a parking spot just before going to sleep. Really easy and well flowing day. Second day back went even better as we just needed an 18 minute stop at the same Katy Wroclawskie charger as before to get 22 kWh and that was enough to get back to Germany. After that we were again flying on the Autobahn and charging as needed, 15 min (31 kWh), 23 min (48 kWh) and 31 min (54 kWh + food). We started the day on about 9:40 and were home at 21:40 after driving just over 1000 km on that day. So less than 12 hours for 1000 km travelled, including all charging, bio stops, food and some traffic jams as well. Not bad. Now let's take a look at all the apps and data connections that a technically minded customer can have for their car. Architecturally the car is a network of computers by itself, but it is very secured and normally people do not have any direct access. However, once you log in into the car with your BMW account the car gets your profile info and preferences (seat settings, navigation favorites, ...) and the car then also can start sending information to the BMW backend about its status. This information is then available to the user over multiple different channels. There is no separate channel for each of those data flow. The data only goes once to the backend and then all other communication of apps happens with the backend. First of all the MyBMW app. This is the go-to for everything about the car - seeing its current status and location (when not driving), sending commands to the car (lock, unlock, flash lights, pre-condition, ...) and also monitor and control charging processes. You can also plan a route or destination in the app in advance and then just send it over to the car so it already knows where to drive to when you get to the car. This can also integrate with calendar entries, if you have locations for appointments, for example. This also shows full charging history and allows a very easy export of that data, here I exported all charging sessions from June and then trimmed it back to only sessions relevant to the trip and cut off some design elements to have the data more visible. So one can very easily see when and where we were charging, how much power we got at each spot and (if you set prices for locations) can even show costs. I've already mentioned the Tronity service and its ABRP integration, but it also saves the information that it gets from the car and gathers that data over time. It has nice aspects, like showing the driven routes on a map, having ways to do business trip accounting and having good calendar view. Sadly it does not correctly capture the data for charging sessions (the amounts are incorrect). Update: after talking to Tronity support, it looks like the bug was in the incorrect value for the usable battery capacity for my car. They will look into getting th eright values there by default, but as a workaround one can edit their car in their system (after at least one charging session) and directly set the expected battery capacity (usable) in the car properties on the Tronity web portal settings. One other fun way to see data from your BMW is using the BMW integration in Home Assistant. This brings the car as a device in your own smart home. You can read all the variables from the car current status (and Home Asisstant makes cute historical charts) and you can even see interesting trends, for example for remaining range shows much higher value in Latvia as its prediction is adapted to Latvian road speeds and during the trip it adapts to Polish and then to German road speeds and thus to higher consumption and thus lower maximum predicted remaining range. Having the car attached to the Home Assistant also allows you to attach the car to automations, both as data and event source (like detecting when car enters the "Home" zone) and also as target, so you could flash car lights or even unlock or lock it when certain conditions are met. So, what in the end was the most important thing - cost of the trip? In total we charged up 863 kWh, so that would normally cost one about 290 , which is close to half what this trip would have costed with a gasoline car. Out of that 279 kWh in Germany (paid by my employer) and 154 kWh in the farmhouse (paid by our wonderful relatives :D) so in the end the charging that I actually need to pay adds up to 430 kWh or about 150 . Typically, it took about 400 in fuel that I had to pay to get to Latvia and back. The difference is really nice! In the end I believe that there are three different ways of charging:
  • incidental charging - this is wast majority of charging in the normal day-to-day life. The car gets charged when and where it is convinient to do so along the way. If we go to a movie or a shop and there is a chance to leave the car at a charger, then it can charge up. Works really well, does not take extra time for charging from us.
  • fast charging - charging up at a HPC during optimal charging conditions - from relatively low level to no more than 70-80% while you are still doing all the normal things one would do in a quick stop in a long travel process: bio things, cleaning the windscreen, getting a coffee or a snack.
  • necessary charging - charging from a whatever charger is available just enough to be able to reach the next destination or the next fast charger.
The last category is the only one that is really annoying and should be avoided at all costs. Even by shifting your plans so that you find something else useful to do while necessary charging is happening and thus, at least partially, shifting it over to incidental charging category. Then you are no longer just waiting for the car, you are doing something else and the car magically is charged up again. And when one does that, then travelling with an electric car becomes no more annoying than travelling with a gasoline car. Having more breaks in a trip is a good thing and makes the trips actually easier and less stressfull - I was more relaxed during and after this trip than during previous trips. Having the car air conditioning always be on, even when stopped, was a godsend in the insane heat wave of 30C-38C that we were driving trough. Final stats: 4425 km driven in the trip. Average consumption: 18.7 kWh/100km. Time driving: 2 days and 3 hours. Car regened 152 kWh. Charging stations recharged 863 kWh. Questions? You can use this i4talk forum thread or this Twitter thread to ask them to me.

28 March 2022

Antoine Beaupr : What is going on with web servers

I stumbled upon this graph recently, which is w3techs.com graph of "Historical yearly trends in the usage statistics of web servers". It seems I hadn't looked at it in a long while because I was surprised at many levels:
  1. Apache is now second, behind Nginx, since ~2022 (so that's really new at least)
  2. Cloudflare "server" is third ahead of the traditional third (Microsoft IIS) - I somewhat knew that Cloudflare was hosting a lot of stuff, but I somehow didn't expect to see it there at all for some reason
  3. I had to lookup what LiteSpeed was (and it's not a bike company): it's a drop-in replacement (!?) of the Apache web server (not a fork, a bizarre idea but which seems to be gaining a lot of speed recently, possibly because of its support for QUIC and HTTP/3
So there. I'm surprised. I guess the stats should be taken with a grain of salt because they only partially correlate with Netcraft's numbers which barely mention LiteSpeed at all. (But they do point at a rising share as well.) Netcraft also puts Nginx's first place earlier in time, around April 2019, which is about when Microsoft's IIS took a massive plunge down. That is another thing that doesn't map with w3techs' graphs at all either. So it's all lies and statistics, basically. Moving on. Oh and of course, the two first most popular web servers, regardless of the source, are package in Debian. So while we're working on statistics and just making stuff up, I'm going to go ahead and claim all of this stuff runs on Linux and that BSD is dead. Or something like that.

7 March 2022

Ayoyimika Ajibade: Progress Report!! Modifying Expectations...

Wait! Just like yesterday when I was accepted as an Outreachy intern and the first half of the internship is finished . How time flies when you are having a good time As part of the requirements for the final application during the contribution period for the Outreachy internship, I needed to provide a timeline to achieve our goal on my outreachy task which is transitioning of dependencies in node16 and webpack5. Having consulted my mentors who implied that the packages depending on webpack and nodejs combined are so numerous that its impossible to finish all within a space of three months but we have steps to guide us through the entire process to achieve most of our goals which are
  • Find a list of packages(failing rebuild and testing with autopkgtest, of reverse dependencies of webpack and nodejs) to fix.
  • See if new upstream versions are available that support nodejs16 and webpack5 respectively.
  • See if the new upstream version works and doesn't fail while rebuilding or testing with autopkgtest.
  • Report bug in Debian if any fails to rebuild or test with autopkgtest.
  • Forward bugs upstream if needed.
  • Fix packages and forward patches.
As of this writing(though a little late ) we have successfully rebuilt all reverse dependencies of webpack5 and split them equally each for I and my co-intern for all Javascript modules as ruby packages also depend on webpack which is a total of 44 packages. Filed a bug report on Debian bug tracking system for failing packages, also the original maintainer or uploader of the package to the Debian archive mostly Debian developers also get a mail in references to the package bug report. Sometimes the uploader who also receives the bug report decides to help out to fix the package and forward the patch upstream if need be. We have also filed an issue to upstream repo mostly via github where some respond and create PR to solve those errors and others are plain aversive to the whole idea. PR from the upstream developer is cherry-picked and a patch is created by us to incorporate the code into our own working repository. some package upstream maintainer rejects such issues or doesn't respond, we take it upon ourselves to fix the package. The total number of packages that are successfully updated and ready to be merged is 10 packages while 12 packages remain on my own end to be updated. One of the most challenging packages to update so far was prop-type as its runs its large test suite using jest of a lower version 19.0.2 compared to that of Debian OS which is version 27.5.1 updating and migrating its API's and methods to use the Debian updated version is so challenging after several googling, testing out the solution from StackOverflow, trials, and errors, reading documentations we eventually made progress with the help of my mentor, co-intern and the whole community member. It's so crazy that when I got it working I said to myself. phew it's not rocket science why can I figure it out sooner than expected I initially proposed that I would be halfway done with the project by now, I guess the reason am not able to achieve some of our goals which are finishing up with the packages for webpack and moving to transition some of the nodejs packages at all is DEBUGGING. Yes DEBUGGING! where you never can predict what the solution is. is the problem coming from Debian? or dependencies of the package you are working on, upstream bug, or dependencies of dependencies of the package you are working on, so many questions to answer. You can't easily find a solution to a bug as it takes time to try out so many guesses more of an educated guess, or even try out all the solutions from stack overflow and still no viable progress. Obviously, you cannot really know about something to set up a plan for unless you get right into it. One way of doing this, if I have to start again is the truly understand how the javascript package work under the hood, how its handles different interaction between packages, some of its dos and don't of transpiling, bundling, testing, e.t.c I guess my unrealistic goals need to be modified because some drawback that was not envisaged popped up and I underestimated the complexity of the tasks, which will be reducing the number of packages to update in transitioning of nodejs from what I planned My major focus for the second half of the internship is to fix bugs and errors I discover, file bug reports for future bugs to seek help from co-maintainer or developers, file issues upstream and close those whose bugs are already resolved for the remaining 12 packages, and ultimately successfully uploading all reverse dependencies. Also diving into transitioning of nodejs16. Thanks for stopping by

2 March 2022

Antoine Beaupr : procmail considered harmful

TL;DR: procmail is a security liability and has been abandoned upstream for the last two decades. If you are still using it, you should probably drop everything and at least remove its SUID flag. There are plenty of alternatives to chose from, and conversion is a one-time, acceptable trade-off.

Procmail is unmaintained procmail is unmaintained. The "Final release", according to Wikipedia, dates back to September 10, 2001 (3.22). That release was shipped in Debian since then, all the way back from Debian 3.0 "woody", twenty years ago. Debian also ships 25 uploads on top of this, with 3.22-21 shipping the "3.23pre" release that has been rumored since at least the November 2001, according to debian/changelog at least:
procmail (3.22-1) unstable; urgency=low
  * New upstream release, which uses the  standard' format for Maildir
    filenames and retries on name collision. It also contains some
    bug fixes from the 3.23pre snapshot dated 2001-09-13.
  * Removed  sendmail' from the Recommends field, since we already
    have  exim' (the default Debian MTA) and  mail-transport-agent'.
  * Removed suidmanager support. Conflicts: suidmanager (<< 0.50).
  * Added support for DEB_BUILD_OPTIONS in the source package.
  * README.Maildir: Do not use locking on the example recipe,
    since it's wrong to do so in this case.
 -- Santiago Vila <sanvila@debian.org>  Wed, 21 Nov 2001 09:40:20 +0100
All Debian suites from buster onwards ship the 3.22-26 release, although the maintainer just pushed a 3.22-27 release to fix a seven year old null pointer dereference, after this article was drafted. Procmail is also shipped in all major distributions: Fedora and its derivatives, Debian derivatives, Gentoo, Arch, FreeBSD, OpenBSD. We all seem to be ignoring this problem. The upstream website (http://procmail.org/) has been down since about 2015, according to Debian bug #805864, with no change since. In effect, every distribution is currently maintaining its fork of this dead program. Note that, after filing a bug to keep Debian from shipping procmail in a stable release again, I was told that the Debian maintainer is apparently in contact with the upstream. And, surprise! they still plan to release that fabled 3.23 release, which has been now in "pre-release" for all those twenty years. In fact, it turns out that 3.23 is considered released already, and that the procmail author actually pushed a 3.24 release, codenamed "Two decades of fixes". That amounts to 25 commits since 3.23pre some of which address serious security issues, but none of which address fundamental issues with the code base.

Procmail is insecure By default, procmail is installed SUID root:mail in Debian. There's no debconf or pre-seed setting that can change this. There has been two bug reports against the Debian to make this configurable (298058, 264011), but both were closed to say that, basically, you should use dpkg-statoverride to change the permissions on the binary. So if anything, you should immediately run this command on any host that you have procmail installed on:
dpkg-statoverride --update --add root root 0755 /usr/bin/procmail
Note that this might break email delivery. It might also not work at all, thanks to usrmerge. Not sure. Yes, everything is on fire. This is fine. In my opinion, even assuming we keep procmail in Debian, that default should be reversed. It should be up to people installing procmail to assign it those dangerous permissions, after careful consideration of the risk involved. The last maintainer of procmail explicitly advised us (in that null pointer dereference bug) and other projects (e.g. OpenBSD, in [2]) to stop shipping it, back in 2014. Quote:
Executive summary: delete the procmail port; the code is not safe and should not be used as a basis for any further work.
I just read some of the code again this morning, after the original author claimed that procmail was active again. It's still littered with bizarre macros like:
#define bit_set(name,which,value) \
  (value?(name[bit_index(which)] =bit_mask(which)):\
  (name[bit_index(which)]&=~bit_mask(which)))
... from regexp.c, line 66 (yes, that's a custom regex engine). Or this one:
#define jj  (aleps.au.sopc)
It uses insecure functions like strcpy extensively. malloc() is thrown around gotos like it's 1984 all over again. (To be fair, it has been feeling like 1984 a lot lately, but that's another matter entirely.) That null pointer deref bug? It's fixed upstream now, in this commit merged a few hours ago, which I presume might be in response to my request to remove procmail from Debian. So while that's nice, this is the just tip of the iceberg. I speculate that one could easily find an exploitable crash in procmail if only by running it through a fuzzer. But I don't need to speculate: procmail had, for years, serious security issues that could possibly lead to root privilege escalation, remotely exploitable if procmail is (as it's designed to do) exposed to the network. Maybe I'm overreacting. Maybe the procmail author will go through the code base and do a proper rewrite. But I don't think that's what is in the cards right now. What I expect will happen next is that people will start fuzzing procmail, throw an uncountable number of bug reports at it which will get fixed in a trickle while never fixing the underlying, serious design flaws behind procmail.

Procmail has better alternatives The reason this is so frustrating is that there are plenty of modern alternatives to procmail which do not suffer from those problems. Alternatives to procmail(1) itself are typically part of mail servers. For example, Dovecot has its own LDA which implements the standard Sieve language (RFC 5228). (Interestingly, Sieve was published as RFC 3028 in 2001, before procmail was formally abandoned.) Courier also has "maildrop" which has its own filtering mechanism, and there is fdm (2007) which is a fetchmail and procmail replacement. Update: there's also mailprocessing, which is not an LDA, but processing an existing folder. It was, however, specifically designed to replace complex Procmail rules. But procmail, of course, doesn't just ship procmail; that would just be too easy. It ships mailstat(1) which we could probably ignore because it only parses procmail log files. But more importantly, it also ships:
  • lockfile(1) - conditional semaphore-file creator
  • formail(1) - mail (re)formatter
lockfile(1) already has a somewhat acceptable replacement in the form of flock(1), part of util-linux (which is Essential, so installed on any normal Debian system). It might not be a direct drop-in replacement, but it should be close enough. formail(1) is similar: the courier maildrop package ships reformail(1) which is, presumably, a rewrite of formail. It's unclear if it's a drop-in replacement, but it should probably possible to port uses of formail to it easily.
Update: the maildrop package ships a SUID root binary (two, even). So if you want only reformail(1), you might want to disable that with:
dpkg-statoverride --update --add root root 0755 /usr/bin/lockmail.maildrop 
dpkg-statoverride --update --add root root 0755 /usr/bin/maildrop
It would be perhaps better to have reformail(1) as a separate package, see bug 1006903 for that discussion.
The real challenge is, of course, migrating those old .procmailrc recipes to Sieve (basically). I added a few examples in the appendix below. You might notice the Sieve examples are easier to read, which is a nice added bonus.

Conclusion There is really, absolutely, no reason to keep procmail in Debian, nor should it be used anywhere at this point. It's a great part of our computing history. May it be kept forever in our museums and historical archives, but not in Debian, and certainly not in actual release. It's just a bomb waiting to go off. It is irresponsible for distributions to keep shipping obsolete and insecure software like this for unsuspecting users. Note that I am grateful to the author, I really am: I used procmail for decades and it served me well. But now, it's time to move, not bring it back from the dead.

Appendix

Previous work It's really weird to have to write this blog post. Back in 2016, I rebuilt my mail setup at home and, to my horror, discovered that procmail had been abandoned for 15 years at that point, thanks to that LWN article from 2010. I would have thought that I was the only weirdo still running procmail after all those years and felt kind of embarrassed to only "now" switch to the more modern (and, honestly, awesome) Sieve language. But no. Since then, Debian shipped three major releases (stretch, buster, and bullseye), all with the same vulnerable procmail release. Then, in early 2022, I found that, at work, we actually had procmail installed everywhere, possibly because userdir-ldap was using it for lockfile until 2019. I sent a patch to fix that and scrambled to remove get rid of procmail everywhere. That took about a day. But many other sites are now in that situation, possibly not imagining they have this glaring security hole in their infrastructure.

Procmail to Sieve recipes I'll collect a few Sieve equivalents to procmail recipes here. If you have any additions, do contact me. All Sieve examples below assume you drop the file in ~/.dovecot.sieve.

deliver mail to "plus" extension folder Say you want to deliver user+foo@example.com to the folder foo. You might write something like this in procmail:
MAILDIR=$HOME/Maildir/
DEFAULT=$MAILDIR
LOGFILE=$HOME/.procmail.log
VERBOSE=off
EXTENSION=$1            # Need to rename it - ?? does not like $1 nor 1
:0
* EXTENSION ?? [a-zA-Z0-9]+
        .$EXTENSION/
That, in sieve language, would be:
require ["variables", "envelope", "fileinto", "subaddress"];
########################################################################
# wildcard +extension
# https://doc.dovecot.org/configuration_manual/sieve/examples/#plus-addressed-mail-filtering
if envelope :matches :detail "to" "*"  
  # Save name in $ name  in all lowercase
  set :lower "name" "$ 1 ";
  fileinto "$ name ";
  stop;
 

Subject into folder This would file all mails with a Subject: line having FreshPorts in it into the freshports folder, and mails from alternc.org mailing lists into the alternc folder:
:0
## mailing list freshports
* ^Subject.*FreshPorts.*
.freshports/
:0
## mailing list alternc
* ^List-Post.*mailto:.*@alternc.org.*
.alternc/
Equivalent Sieve:
if header :contains "subject" "FreshPorts"  
    fileinto "freshports";
  elsif header :contains "List-Id" "alternc.org"  
    fileinto "alternc";
 

Mail sent to root to a reports folder This double rule:
:0
* ^Subject: Cron
* ^From: .*root@
.rapports/
Would look something like this in Sieve:
if header :comparator "i;octet" :contains "Subject" "Cron"  
  if header :regex :comparator "i;octet"  "From" ".*root@"  
        fileinto "rapports";
   
 
Note that this is what the automated converted does (below). It's not very readable, but it works.

Bulk email I didn't have an equivalent of this in procmail, but that's something I did in Sieve:
if header :contains "Precedence" "bulk"  
    fileinto "bulk";
 

Any mailing list This is another rule I didn't have in procmail but I found handy and easy to do in Sieve:
if exists "List-Id"  
    fileinto "lists";
 

This or that I wouldn't remember how to do this in procmail either, but that's an easy one in Sieve:
if anyof (header :contains "from" "example.com",
           header :contains ["to", "cc"] "anarcat@example.com")  
    fileinto "example";
 
You can even pile up a bunch of options together to have one big rule with multiple patterns:
if anyof (exists "X-Cron-Env",
          header :contains ["subject"] ["security run output",
                                        "monthly run output",
                                        "daily run output",
                                        "weekly run output",
                                        "Debian Package Updates",
                                        "Debian package update",
                                        "daily mail stats",
                                        "Anacron job",
                                        "nagios",
                                        "changes report",
                                        "run output",
                                        "[Systraq]",
                                        "Undelivered mail",
                                        "Postfix SMTP server: errors from",
                                        "backupninja",
                                        "DenyHosts report",
                                        "Debian security status",
                                        "apt-listchanges"
                                        ],
           header :contains "Auto-Submitted" "auto-generated",
           envelope :contains "from" ["nagios@",
                                      "logcheck@",
                                      "root@"])
     
    fileinto "rapports";
 

Automated script There is a procmail2sieve.pl script floating around, and mentioned in the dovecot documentation. It didn't work very well for me: I could use it for small things, but I mostly wrote the sieve file from scratch.

Progressive migration Enrico Zini has progressively migrated his procmail setup to Sieve using a clever way: he hooked procmail inside sieve so that he could deliver to the Dovecot LDA and progressively migrate rules one by one, without having a "flag day". See this explanatory blog post for the details, which also shows how to configure Dovecot as an LMTP server with Postfix.

Other examples The Dovecot sieve examples are numerous and also quite useful. At the time of writing, they include virus scanning and spam filtering, vacation auto-replies, includes, archival, and flags.

Harmful considered harmful I am aware that the "considered harmful" title has a long and controversial history, being considered harmful in itself (by some people who are obviously not afraid of contradictions). I have nevertheless deliberately chosen that title, partly to make sure this article gets maximum visibility, but more specifically because I do not have doubts at this moment that procmail is, clearly, a bad idea at this moment in history.

Developing story I must also add that, incredibly, this story has changed while writing it. This article is derived from this bug I filed in Debian to, quite frankly, kick procmail out of Debian. But filing the bug had the interesting effect of pushing the upstream into action: as mentioned above, they have apparently made a new release and merged a bunch of patches in a new git repository. This doesn't change much of the above, at this moment. If anything significant comes out of this effort, I will try to update this article to reflect the situation. I am actually happy to retract the claims in this article if it turns out that procmail is a stellar example of defensive programming and survives fuzzing attacks. But at this moment, I'm pretty confident that will not happen, at least not in scope of the next Debian release cycle.

23 January 2022

Antoine Beaupr : Switching from OpenNTPd to Chrony

A friend recently reminded me of the existence of chrony, a "versatile implementation of the Network Time Protocol (NTP)". The excellent introduction is worth quoting in full:
It can synchronise the system clock with NTP servers, reference clocks (e.g. GPS receiver), and manual input using wristwatch and keyboard. It can also operate as an NTPv4 (RFC 5905) server and peer to provide a time service to other computers in the network. It is designed to perform well in a wide range of conditions, including intermittent network connections, heavily congested networks, changing temperatures (ordinary computer clocks are sensitive to temperature), and systems that do not run continuosly, or run on a virtual machine. Typical accuracy between two machines synchronised over the Internet is within a few milliseconds; on a LAN, accuracy is typically in tens of microseconds. With hardware timestamping, or a hardware reference clock, sub-microsecond accuracy may be possible.
Now that's already great documentation right there. What it is, why it's good, and what to expect from it. I want more. They have a very handy comparison table between chrony, ntp and openntpd.

My problem with OpenNTPd Following concerns surrounding the security (and complexity) of the venerable ntp program, I have, a long time ago, switched to using openntpd on all my computers. I hadn't thought about it until I recently noticed a lot of noise on one of my servers:
jan 18 10:09:49 curie ntpd[1069]: adjusting local clock by -1.604366s
jan 18 10:08:18 curie ntpd[1069]: adjusting local clock by -1.577608s
jan 18 10:05:02 curie ntpd[1069]: adjusting local clock by -1.574683s
jan 18 10:04:00 curie ntpd[1069]: adjusting local clock by -1.573240s
jan 18 10:02:26 curie ntpd[1069]: adjusting local clock by -1.569592s
You read that right, openntpd was constantly rewinding the clock, sometimes in less than two minutes. The above log was taken while doing diagnostics, looking at the last 30 minutes of logs. So, on average, one 1.5 seconds rewind per 6 minutes! That might be due to a dying real time clock (RTC) or some other hardware problem. I know for a fact that the CMOS battery on that computer (curie) died and I wasn't able to replace it (!). So that's partly garbage-in, garbage-out here. But still, I was curious to see how chrony would behave... (Spoiler: much better.) But I also had trouble on another workstation, that one a much more recent machine (angela). First, it seems OpenNTPd would just fail at boot time:
anarcat@angela:~(main)$ sudo systemctl status openntpd
  openntpd.service - OpenNTPd Network Time Protocol
     Loaded: loaded (/lib/systemd/system/openntpd.service; enabled; vendor pres>
     Active: inactive (dead) since Sun 2022-01-23 09:54:03 EST; 6h ago
       Docs: man:openntpd(8)
    Process: 3291 ExecStartPre=/usr/sbin/ntpd -n $DAEMON_OPTS (code=exited, sta>
    Process: 3294 ExecStart=/usr/sbin/ntpd $DAEMON_OPTS (code=exited, status=0/>
   Main PID: 3298 (code=exited, status=0/SUCCESS)
        CPU: 34ms
jan 23 09:54:03 angela systemd[1]: Starting OpenNTPd Network Time Protocol...
jan 23 09:54:03 angela ntpd[3291]: configuration OK
jan 23 09:54:03 angela ntpd[3297]: ntp engine ready
jan 23 09:54:03 angela ntpd[3297]: ntp: recvfrom: Permission denied
jan 23 09:54:03 angela ntpd[3294]: Terminating
jan 23 09:54:03 angela systemd[1]: Started OpenNTPd Network Time Protocol.
jan 23 09:54:03 angela systemd[1]: openntpd.service: Succeeded.
After a restart, somehow it worked, but it took a long time to sync the clock. At first, it would just not consider any peer at all:
anarcat@angela:~(main)$ sudo ntpctl -s all
0/20 peers valid, clock unsynced
peer
   wt tl st  next  poll          offset       delay      jitter
159.203.8.72 from pool 0.debian.pool.ntp.org
    1  5  2    6s    6s             ---- peer not valid ----
138.197.135.239 from pool 0.debian.pool.ntp.org
    1  5  2    6s    7s             ---- peer not valid ----
216.197.156.83 from pool 0.debian.pool.ntp.org
    1  4  1    2s    9s             ---- peer not valid ----
142.114.187.107 from pool 0.debian.pool.ntp.org
    1  5  2    5s    6s             ---- peer not valid ----
216.6.2.70 from pool 1.debian.pool.ntp.org
    1  4  2    2s    8s             ---- peer not valid ----
207.34.49.172 from pool 1.debian.pool.ntp.org
    1  4  2    0s    5s             ---- peer not valid ----
198.27.76.102 from pool 1.debian.pool.ntp.org
    1  5  2    5s    5s             ---- peer not valid ----
158.69.254.196 from pool 1.debian.pool.ntp.org
    1  4  3    1s    6s             ---- peer not valid ----
149.56.121.16 from pool 2.debian.pool.ntp.org
    1  4  2    5s    9s             ---- peer not valid ----
162.159.200.123 from pool 2.debian.pool.ntp.org
    1  4  3    1s    6s             ---- peer not valid ----
206.108.0.131 from pool 2.debian.pool.ntp.org
    1  4  1    6s    9s             ---- peer not valid ----
205.206.70.40 from pool 2.debian.pool.ntp.org
    1  5  2    8s    9s             ---- peer not valid ----
2001:678:8::123 from pool 2.debian.pool.ntp.org
    1  4  2    5s    9s             ---- peer not valid ----
2606:4700:f1::1 from pool 2.debian.pool.ntp.org
    1  4  3    2s    6s             ---- peer not valid ----
2607:5300:205:200::1991 from pool 2.debian.pool.ntp.org
    1  4  2    5s    9s             ---- peer not valid ----
2607:5300:201:3100::345c from pool 2.debian.pool.ntp.org
    1  4  4    1s    6s             ---- peer not valid ----
209.115.181.110 from pool 3.debian.pool.ntp.org
    1  5  2    5s    6s             ---- peer not valid ----
205.206.70.42 from pool 3.debian.pool.ntp.org
    1  4  2    0s    6s             ---- peer not valid ----
68.69.221.61 from pool 3.debian.pool.ntp.org
    1  4  1    2s    9s             ---- peer not valid ----
162.159.200.1 from pool 3.debian.pool.ntp.org
    1  4  3    4s    7s             ---- peer not valid ----
Then it would accept them, but still wouldn't sync the clock:
anarcat@angela:~(main)$ sudo ntpctl -s all
20/20 peers valid, clock unsynced
peer
   wt tl st  next  poll          offset       delay      jitter
159.203.8.72 from pool 0.debian.pool.ntp.org
    1  8  2    5s    6s         0.672ms    13.507ms     0.442ms
138.197.135.239 from pool 0.debian.pool.ntp.org
    1  7  2    4s    8s         1.260ms    13.388ms     0.494ms
216.197.156.83 from pool 0.debian.pool.ntp.org
    1  7  1    3s    5s        -0.390ms    47.641ms     1.537ms
142.114.187.107 from pool 0.debian.pool.ntp.org
    1  7  2    1s    6s        -0.573ms    15.012ms     1.845ms
216.6.2.70 from pool 1.debian.pool.ntp.org
    1  7  2    3s    8s        -0.178ms    21.691ms     1.807ms
207.34.49.172 from pool 1.debian.pool.ntp.org
    1  7  2    4s    8s        -5.742ms    70.040ms     1.656ms
198.27.76.102 from pool 1.debian.pool.ntp.org
    1  7  2    0s    7s         0.170ms    21.035ms     1.914ms
158.69.254.196 from pool 1.debian.pool.ntp.org
    1  7  3    5s    8s        -2.626ms    20.862ms     2.032ms
149.56.121.16 from pool 2.debian.pool.ntp.org
    1  7  2    6s    8s         0.123ms    20.758ms     2.248ms
162.159.200.123 from pool 2.debian.pool.ntp.org
    1  8  3    4s    5s         2.043ms    14.138ms     1.675ms
206.108.0.131 from pool 2.debian.pool.ntp.org
    1  6  1    0s    7s        -0.027ms    14.189ms     2.206ms
205.206.70.40 from pool 2.debian.pool.ntp.org
    1  7  2    1s    5s        -1.777ms    53.459ms     1.865ms
2001:678:8::123 from pool 2.debian.pool.ntp.org
    1  6  2    1s    8s         0.195ms    14.572ms     2.624ms
2606:4700:f1::1 from pool 2.debian.pool.ntp.org
    1  7  3    6s    9s         2.068ms    14.102ms     1.767ms
2607:5300:205:200::1991 from pool 2.debian.pool.ntp.org
    1  6  2    4s    9s         0.254ms    21.471ms     2.120ms
2607:5300:201:3100::345c from pool 2.debian.pool.ntp.org
    1  7  4    5s    9s        -1.706ms    21.030ms     1.849ms
209.115.181.110 from pool 3.debian.pool.ntp.org
    1  7  2    0s    7s         8.907ms    75.070ms     2.095ms
205.206.70.42 from pool 3.debian.pool.ntp.org
    1  7  2    6s    9s        -1.729ms    53.823ms     2.193ms
68.69.221.61 from pool 3.debian.pool.ntp.org
    1  7  1    1s    7s        -1.265ms    46.355ms     4.171ms
162.159.200.1 from pool 3.debian.pool.ntp.org
    1  7  3    4s    8s         1.732ms    35.792ms     2.228ms
It took a solid five minutes to sync the clock, even though the peers were considered valid within a few seconds:
jan 23 15:58:41 angela systemd[1]: Started OpenNTPd Network Time Protocol.
jan 23 15:58:58 angela ntpd[84086]: peer 142.114.187.107 now valid
jan 23 15:58:58 angela ntpd[84086]: peer 198.27.76.102 now valid
jan 23 15:58:58 angela ntpd[84086]: peer 207.34.49.172 now valid
jan 23 15:58:58 angela ntpd[84086]: peer 209.115.181.110 now valid
jan 23 15:58:59 angela ntpd[84086]: peer 159.203.8.72 now valid
jan 23 15:58:59 angela ntpd[84086]: peer 138.197.135.239 now valid
jan 23 15:58:59 angela ntpd[84086]: peer 162.159.200.123 now valid
jan 23 15:58:59 angela ntpd[84086]: peer 2607:5300:201:3100::345c now valid
jan 23 15:59:00 angela ntpd[84086]: peer 2606:4700:f1::1 now valid
jan 23 15:59:00 angela ntpd[84086]: peer 158.69.254.196 now valid
jan 23 15:59:01 angela ntpd[84086]: peer 216.6.2.70 now valid
jan 23 15:59:01 angela ntpd[84086]: peer 68.69.221.61 now valid
jan 23 15:59:01 angela ntpd[84086]: peer 205.206.70.40 now valid
jan 23 15:59:01 angela ntpd[84086]: peer 205.206.70.42 now valid
jan 23 15:59:02 angela ntpd[84086]: peer 162.159.200.1 now valid
jan 23 15:59:04 angela ntpd[84086]: peer 216.197.156.83 now valid
jan 23 15:59:05 angela ntpd[84086]: peer 206.108.0.131 now valid
jan 23 15:59:05 angela ntpd[84086]: peer 2001:678:8::123 now valid
jan 23 15:59:05 angela ntpd[84086]: peer 149.56.121.16 now valid
jan 23 15:59:07 angela ntpd[84086]: peer 2607:5300:205:200::1991 now valid
jan 23 16:03:47 angela ntpd[84086]: clock is now synced
That seems kind of odd. It was also frustrating to have very little information from ntpctl about the state of the daemon. I understand it's designed to be minimal, but it could inform me on his known offset, for example. It does tell me about the offset with the different peers, but not as clearly as one would expect. It's also unclear how it disciplines the RTC at all.

Compared to chrony Now compare with chrony:
jan 23 16:07:16 angela systemd[1]: Starting chrony, an NTP client/server...
jan 23 16:07:16 angela chronyd[87765]: chronyd version 4.0 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 -DEBUG)
jan 23 16:07:16 angela chronyd[87765]: Initial frequency 3.814 ppm
jan 23 16:07:16 angela chronyd[87765]: Using right/UTC timezone to obtain leap second data
jan 23 16:07:16 angela chronyd[87765]: Loaded seccomp filter
jan 23 16:07:16 angela systemd[1]: Started chrony, an NTP client/server.
jan 23 16:07:21 angela chronyd[87765]: Selected source 206.108.0.131 (2.debian.pool.ntp.org)
jan 23 16:07:21 angela chronyd[87765]: System clock TAI offset set to 37 seconds
First, you'll notice there's none of that "clock synced" nonsense, it picks a source, and then... it's just done. Because the clock on this computer is not drifting that much, and openntpd had (presumably) just sync'd it anyways. And indeed, if we look at detailed stats from the powerful chronyc client:
anarcat@angela:~(main)$ sudo chronyc tracking
Reference ID    : CE6C0083 (ntp1.torix.ca)
Stratum         : 2
Ref time (UTC)  : Sun Jan 23 21:07:21 2022
System time     : 0.000000311 seconds slow of NTP time
Last offset     : +0.000807989 seconds
RMS offset      : 0.000807989 seconds
Frequency       : 3.814 ppm fast
Residual freq   : -24.434 ppm
Skew            : 1000000.000 ppm
Root delay      : 0.013200894 seconds
Root dispersion : 65.357254028 seconds
Update interval : 1.4 seconds
Leap status     : Normal
We see that we are nanoseconds away from NTP time. That was ran very quickly after starting the server (literally in the same second as chrony picked a source), so stats are a bit weird (e.g. the Skew is huge). After a minute or two, it looks more reasonable:
Reference ID    : CE6C0083 (ntp1.torix.ca)
Stratum         : 2
Ref time (UTC)  : Sun Jan 23 21:09:32 2022
System time     : 0.000487002 seconds slow of NTP time
Last offset     : -0.000332960 seconds
RMS offset      : 0.000751204 seconds
Frequency       : 3.536 ppm fast
Residual freq   : +0.016 ppm
Skew            : 3.707 ppm
Root delay      : 0.013363549 seconds
Root dispersion : 0.000324015 seconds
Update interval : 65.0 seconds
Leap status     : Normal
Now it's learning how good or bad the RTC clock is ("Frequency"), and is smoothly adjusting the System time to follow the average offset (RMS offset, more or less). You'll also notice the Update interval has risen, and will keep expanding as chrony learns more about the internal clock, so it doesn't need to constantly poll the NTP servers to sync the clock. In the above, we're 487 micro seconds (less than a milisecond!) away from NTP time. (People interested in the explanation of every single one of those fields can read the excellent chronyc manpage. That thing made me want to nerd out on NTP again!) On the machine with the bad clock, chrony also did a 1.5 second adjustment, but just once, at startup:
jan 18 11:54:33 curie chronyd[2148399]: Selected source 206.108.0.133 (2.debian.pool.ntp.org) 
jan 18 11:54:33 curie chronyd[2148399]: System clock wrong by -1.606546 seconds 
jan 18 11:54:31 curie chronyd[2148399]: System clock was stepped by -1.606546 seconds 
jan 18 11:54:31 curie chronyd[2148399]: System clock TAI offset set to 37 seconds 
Then it would still struggle to keep the clock in sync, but not as badly as openntpd. Here's the offset a few minutes after that above startup:
System time     : 0.000375352 seconds slow of NTP time
And again a few seconds later:
System time     : 0.001793046 seconds slow of NTP time
I don't currently have access to that machine, and will update this post with the latest status, but so far I've had a very good experience with chrony on that machine, which is a testament to its resilience, and it also just works on my other machines as well.

Extras On top of "just working" (as demonstrated above), I feel that chrony's feature set is so much superior... Here's an excerpt of the extras in chrony, taken from the comparison table:
  • source frequency tracking
  • source state restore from file
  • temperature compensation
  • ready for next NTP era (year 2036)
  • replace unreachable / falseticker servers
  • aware of jitter
  • RTC drift tracking
  • RTC trimming
  • Restore time from file w/o RTC
  • leap seconds correction, in slew mode
  • drops root privileges
I even understand some of that stuff. I think. So kudos to the chrony folks, I'm switching.

Caveats One thing to keep in mind in the above, however is that it's quite possible chrony does as bad of a job as openntpd on that old machine, and just doesn't tell me about it. For example, here's another log sample from another server (marcos):
jan 23 11:13:25 marcos ntpd[1976694]: adjusting clock frequency by 0.451035 to -16.420273ppm
I get those basically every day, which seems to show that it's at least trying to keep track of the hardware clock. In other words, it's quite possible I have no idea what I'm talking about and you definitely need to take this article with a grain of salt. I'm not an NTP expert. Update: I should also mentioned that I haven't evaluated systemd-timesyncd, for a few reasons:
  1. I have enough things running under systemd
  2. I wasn't aware of it when I started writing this
  3. I couldn't find good documentation on it... later I found the above manpage and of course the Arch Wiki but that is very minimal
  4. therefore I can't tell how it compares with chrony or (open)ntpd, so I don't see an enticing reason to switch
It has a few things going for it though:
  • it's likely shipped with your distribution already
  • it drops privileges (possibly like chrony, unclear if it also has seccomp filters)
  • it's minimalist: it only does SNTP so not the server side
  • the status command is good enough that you can tell the clock frequency, precision, and so on (especially when compared to openntpd's ntpctl)
So I'm reserving judgement over it, but I'd certainly note that I'm always a little weary in trusting systemd daemons with the network, and would prefer to keep that attack surface to a minimum. Diversity is a good thing, in general, so I'll keep chrony for now. It would certainly nice to see it added to chrony's comparison table.

Switching to chrony Because the default configuration in chrony (at least as shipped in Debian) is sane (good default peers, no open network by default), installing it is as simple as:
apt install chrony
And because it somehow conflicts with openntpd, that also takes care of removing that cruft as well.

Update: Debian defaults So it seems like I managed to write this entire blog post without putting it in relation with the original reason I had to think about this in the first place, which is odd and should be corrected. This conversation came about on an IRC channel that mentioned that the ntp package (and upstream) is in bad shape in Debian. In that discussion, chrony and ntpsec were discussed as possible replacements, but when we had the discussion on chat, I mentioned I was using openntpd, and promptly realized I was actually unhappy with it. A friend suggested chrony, I tried it, and it worked amazingly, I switched, wrote this blog post, end of story. Except today (2022-02-07, two weeks later), I actually read that thread and realized that something happened in Debian I wasn't actually aware of. In bookworm, systemd-timesyncd was not only shipped, but it was installed by default, as it was marked as a hard dependency of systemd. That was "fixed" in systemd-247.9-2 (see bug 986651), but only by making the dependency a Recommends and marking it as Priority: important. So in effect, systemd-timesyncd became the default NTP daemon in Debian in bookworm, which I find somewhat surprising. timesyncd has many things going for it (as mentioned above), but I do find it a bit annoying that systemd is replacing all those utilities in such a way. I also wonder what is going to happen on upgrades. This is all a little frustrating too because there is no good comparison between the other NTP daemons and timesyncd anywhere. The chrony comparison table doesn't mention it, and an audit by the Core Infrastructure Initiative from 2017 doesn't mention it either, even though timesyncd was announced in 2014. (Same with this blog post from Facebook.)

16 January 2022

Russell Coker: SSD Endurance

I previously wrote about the issue of swap potentially breaking SSD [1]. My conclusion was that swap wouldn t be a problem as no normally operating systems that I run had swap using any significant fraction of total disk writes. In that post the most writes I could see was 128GB written per day on a 120G Intel SSD (writing the entire device once a day). My post about swap and SSD was based on the assumption that you could get many thousands of writes to the entire device which was incorrect. Here s a background on the terminology from WD [2]. So in the case of the 120G Intel SSD I was doing over 1 DWPD (Drive Writes Per Day) which is in the middle of the range of SSD capability, Intel doesn t specify the DWPD or TBW (Tera Bytes Written) for that device. The most expensive and high end NVMe device sold by my local computer store is the Samsung 980 Pro which has a warranty of 150TBW for the 250G device and 600TBW for the 1TB device [3]. That means that the system which used to have an Intel SSD would have exceeded the warranty in 3 years if it had a 250G device. My current workstation has been up for just over 7 days and has averaged 110GB written per day. It has some light VM use and the occasional kernel compile, a fairly typical developer workstation. It s storage is 2*Crucial 1TB NVMe devices in a BTRFS RAID-1, the NVMe devices are the old series of Crucial ones and are rated for 200TBW which means that they can be expected to last for 5 years under the current load. This isn t a real problem for me as the performance of those devices is lower than I hoped for so I will buy faster ones before they are 5yo anyway. My home server (and my wife s workstation) is averaging 325GB per day on the SSDs used for the RAID-1 BTRFS filesystem for root and for most data that is written much (including VMs). The SSDs are 500G Samsung 850 EVOs [4] which are rated at 150TBW which means just over a year of expected lifetime. The SSDs are much more than a year old, I think Samsung stopped selling them more than a year ago. Between the 2 SSDs SMART reports 18 uncorrectable errors and btrfs device stats reports 55 errors on one of them. I m not about to immediately replace them, but it appears that they are well past their prime. The server which runs my blog (among many other things) is averaging over 1TB written per day. It currently has a RAID-1 of hard drives for all storage but it s previous incarnation (which probably had about the same amount of writes) had a RAID-1 of enterprise SSDs for the most written data. After a few years of running like that (and some time running with someone else s load before it) the SSDs became extremely slow (sustained writes of 15MB/s) and started getting errors. So that s a pair of SSDs that were burned out. Conclusion The amounts of data being written are steadily increasing. Recent machines with more RAM can decrease storage usage in some situations but that doesn t compare to the increased use of checksummed and logged filesystems, VMs, databases for local storage, and other things that multiply writes. The amount of writes allowed under warranty isn t increasing much and there are new technologies for larger SSD storage that decrease the DWPD rating of the underlying hardware. For the systems I own it seems that they are all going to exceed the rated TBW for the SSDs before I have other reasons to replace them, and they aren t particularly high usage systems. A mail server for a large number of users would hit it much earlier. RAID of SSDs is a really good thing. Replacement of SSDs is something that should be planned for and a way of swapping SSDs to less important uses is also good (my parents have some SSDs that are too small for my current use but which work well for them). Another thing to consider is that if you have a server with spare drive bays you could put some extra SSDs in to spread the wear among a larger RAID-10 array. Instead of having a 2*SSD BTRFS RAID-1 for a server you could have 6*SSD to get a 3* longer lifetime than a regular RAID-1 before the SSDs wear out (BTRFS supports this sort of thing). Based on these calculations and the small number of errors I ve seen on my home server I ll add a 480G SSD I have lying around to the array to spread the load and keep it running for a while longer.

Next.

Previous.